Apparatus, system and method for detecting and correcting erroneous speculative branch target address cache branches

ABSTRACT

An apparatus for detecting erroneous speculative branches made by a pipelined microprocessor and for correcting the erroneous branches. A branch target address cache (BTAC) caches target addresses of executed branch instructions. A speculative branch is performed to a cached target address early in the pipeline based on a hit in the BTAC of an instruction cache fetch address before the instruction is decoded. When the speculative branch is performed, a hit bit is set. Later in the pipeline, the presumed branch instruction is decoded and executed. If the hit bit is set for the instruction, the decoded instruction is examined and the correct target address and direction are compared to the speculative versions to determine if an error was made by speculatively branching. If an error is detected, the branch target address cache is updated or invalidated, and the processor branches to the appropriate address to correct the error.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application is related to the following U.S. Patentapplications, having a common filing date and a common assignee. Each ofthese applications is hereby incorporated by reference in its entiretyfor all purposes: Docket # Serial # Title CNTR: 2021 SPECULATIVE BRANCHTARGET ADDRESS CACHE CNTR: 2023 SPECULATIVE HYBRID BRANCH DIRECTIONPREDICTOR CNTR: 2050 DUAL CALL/RETURN STACK BRANCH PREDICTION SYSTEMCNTR: 2052 SPECULATIVE BRANCH TARGET ADDRESS CACHE WITH SELECTIVEOVERRIDE BY SECONDARY PREDICTOR BASED ON BRANCH INSTRUCTION TYPE CNTR:2062 APPARATUS AND METHOD FOR SELECT- ING ONE OF MULTIPLE TARGETADDRESSES STORED IN A SPECULATIVE BRANCH TARGET ADDRESS CACHE PERINSTRUCTION CACHE LINE CNTR: 2063 APPARATUS AND METHOD FOR TARGETADDRESS REPLACEMENT IN SPECULA- TIVE BRANCH TARGET ADDRESS CACHE

FIELD OF THE INVENTION

[0002] This invention relates in general to the field of branchprediction in microprocessors, and more particularly to branch targetaddress caching.

BACKGROUND OF THE INVENTION

[0003] Computer instructions are typically stored in successiveaddressable locations within a memory. When processed by a CentralProcessing Unit (CPU), or processor, the instructions are fetched fromconsecutive memory locations and executed. Each time an instruction isfetched from memory, a program counter (PC), or instruction pointer(IP), within the CPU is incremented so that it contains the address ofthe next instruction in the sequence. This is the next sequentialinstruction pointer, or NSIP. Fetching of an instruction, incrementingof the program counter, and execution of the instruction continueslinearly through memory until a program control instruction isencountered.

[0004] A program control instruction, also referred to as a branchinstruction, when executed, changes the address in the program counterand causes the flow of control to be altered. In other words, branchinstructions specify conditions for altering the contents of the programcounter. The change in the value of the program counter because of theexecution of a branch instruction causes a break in the sequence ofinstruction execution. This is an important feature in digitalcomputers, as it provides control over the flow of program execution anda capability for branching to different portions of a program. Examplesof program control instructions include jump, conditional jump, call,and return.

[0005] A jump instruction causes the CPU to unconditionally change thecontents of the program counter to a specific value, i.e., to the targetaddress for the instruction where the program is to continue execution.A conditional jump causes the CPU to test the contents of a statusregister, or possibly compare two values, and either continue sequentialexecution or jump to a new address, called the target address, based onthe outcome of the test or comparison. A call instruction causes the CPUto unconditionally jump to a new target address, but also saves thevalue of the program counter to allow the CPU to return to the programlocation it is leaving. A return instruction causes the CPU to retrievethe value of the program counter that was saved by the last callinstruction, and return program flow back to the retrieved instructionaddress.

[0006] In early microprocessors, execution of program controlinstructions did not impose significant processing delays because suchmicroprocessors were designed to execute only one instruction at a time.If the instruction being executed was a program control instruction, bythe end of execution the microprocessor would know whether it shouldbranch, and if it was supposed to branch, it would know the targetaddress of the branch. Thus, whether the next instruction wassequential, or the result of a branch, it would be fetched and executed.

[0007] Modern microprocessors are not so simple. Rather, it is commonfor modern microprocessors to operate on several instructions at thesame time, within different blocks or pipeline stages of themicroprocessor. Hennessy and Patterson define pipelining as, “animplementation technique whereby multiple instructions are overlapped inexecution.” Computer Architecture: A Quantitative Approach, 2^(nd)edition, by John L. Hennessy and David A. Patterson, Morgan KaufmannPublishers, San Francisco, Calif. 1996. The authors go on to provide thefollowing excellent illustration of pipelining:

[0008] “A pipeline is like an assembly line. In an automobile assemblyline, there are many steps, each contributing something to theconstruction of the car. Each step operates in parallel with the othersteps, though on a different car. In a computer pipeline, each step inthe pipeline completes a part of an instruction. Like the assembly line,different steps are completing different parts of the differentinstructions in parallel. Each of these steps is called a pipe stage ora pipe segment. The stages are connected one to the next to form apipe—instructions enter at one end, progress through the stages, andexit at the other end, just as cars would in an assembly line.”

[0009] Thus, as instructions are fetched, they are introduced into oneend of the pipeline. They proceed through pipeline stages within amicroprocessor until they complete execution. In such pipelinedmicroprocessors, it is often not known whether a branch instruction willalter program flow until it reaches a late stage in the pipeline.However, by this time, the microprocessor has already fetched otherinstructions and is executing them in earlier stages of the pipeline. Ifa branch instruction causes a change in program flow, all of theinstructions in the pipeline that followed the branch instruction mustbe thrown out. In addition, the instruction specified by the targetaddress of the branch instruction must be fetched. Throwing out theintermediate instructions and fetching the instruction at the targetaddress creates processing delays in such microprocessors, referred toas a branch penalty.

[0010] To alleviate this delay problem, many pipelined microprocessorsuse branch prediction mechanisms in an early stage of the pipeline thatmake predictions of branch instructions. The branch predictionmechanisms predict the outcome, or direction, of the branch instruction,i.e., whether the branch will be taken or not taken. The branchprediction mechanisms also predict the branch target address of thebranch instruction, i.e., the address of the instruction that will bebranched to by the branch instruction. The processor then branches tothe predicted branch target address, i.e., fetches subsequentinstructions according to the branch prediction, sooner than it wouldwithout the branch prediction, thereby potentially reducing the penaltyif the branch is taken.

[0011] A branch prediction mechanism that caches target addresses ofpreviously executed branch instructions is referred to as a branchtarget address cache (BTAC), or branch target buffer (BTB). In a simpleBTAC or BTB, when the processor decodes a branch instruction, theprocessor provides the branch instruction address to the BTAC. If theaddress generates a hit in the BTAC and the branch is predicted taken,then the processor may use the cached target address from the BTAC tobegin fetching instructions at the target address, rather than at thenext sequential instruction address.

[0012] The benefit of the BTAC over a predictor that merely predictstaken/not taken, such as a branch history table (BHT) is that the BTACsaves the time needed to calculate the target address beyond the timeneeded to determine that a branch instruction has been encountered.Typically, branch prediction information (e.g., taken/not taken) isstored in the BTAC along with the target address. A BTAC is historicallyemployed at the instruction decode stages of the pipeline. This isbecause the processor must first determine that a branch instruction ispresent.

[0013] An example of a processor that employs a BTB is the Intel®Pentium® II and III processor. Referring now to FIG. 1, a block diagramof relevant portions of a Pentium II/III processor 100 is shown. Theprocessor 100 includes a BTB 134 that caches branch target addresses.The processor 100 fetches instructions from an instruction cache 102that caches instructions 108 and pre-decoded branch predictioninformation 104. The pre-decoded branch prediction information 104 mayinclude information such as an instruction type or an instructionlength. Instructions are fetched from the instruction cache 102 andprovided to instruction decode logic 132 that decodes, or translates,instructions.

[0014] Typically, instructions are fetched from a next sequential fetchaddress 112, which is simply the current instruction cache 102 fetchaddress 122 incremented by the size of an instruction cache 102 line byan incrementer 118. However, if a branch instruction is decoded by theinstruction decode logic 132, then control logic 114 selectivelycontrols a multiplexer 116 to select the branch target address 136supplied by the BTB 134 as the fetch address 122 for the instructioncache 102 rather than selecting the next sequential fetch address 112.The control logic 114 selects the instruction cache 102 fetch address122 based on the pre-decode information 104 from the instruction cache102 and whether the BTB 134 predicts the branch instruction will betaken or not taken based on an instruction pointer 138 used to index theBTB 134.

[0015] Rather than indexing the BTB 134 with the instruction pointer ofthe branch instruction itself, the Pentium II/III indexes the BTB 134with the instruction pointer 138 of an instruction prior to the branchinstruction being predicted. This enables the BTB 134 to lookup thetarget address 136 while the branch instruction is being decoded.Otherwise, the processor 100 would have to wait to branch an additionalbranch penalty delay of waiting to perform the BTB 134 lookup after thebranch instruction is decoded. Presumably, once the branch instructionis decoded by the instruction decode logic 132 and the processor 100knows that the target address 136 was generated based on certainty thata branch instruction is present, only then does the processor 100 branchto the target address 136 provided by the BTB 134 based on theinstruction pointer 138 index.

[0016] Another example of a processor that employs a BTAC is the AMD®Athlon® processor. Referring now to FIG. 2, a block diagram of relevantportions of an Athlon processor 200 is shown. The processor 200 includessimilar elements to the Pentium II/III of FIG. 1 similarly labeled. TheAthlon processor 200 integrates its BTAC into its instruction cache 202.That is, the instruction cache 202 caches branch target addresses 206 inaddition to instruction data 108 and pre-decoded branch predictioninformation 104. For each instruction byte pair, the instruction cache202 reserves two bits for predicting the direction of the branchinstruction. The instruction cache 202 reserves space for two branchtarget addresses per 16-bytes worth of instructions in a line of theinstruction cache 202.

[0017] As may be observed from FIG. 2, the instruction cache 202 isindexed by a fetch address 122. The BTAC is also indexed by the fetchaddress 122 because the BTAC is integrated into the instruction cache202. Consequently, if a hit occurs for a line in the instruction cache202, there is certainty that the cached branch target address 206corresponds to a branch instruction existent in the indexed instructioncache 202 line.

[0018] Although the prior methods provide branch predictionimprovements, there are disadvantages to the prior methods. Adisadvantage of both the prior methods discussed above is that theinstruction pre-decode information, and in the case of Athlon the branchtarget addresses, substantially increase the size of the instructioncache. It has been speculated that for Athlon the branch predictioninformation essentially doubles the size of the instruction cache.Additionally, the Pentium II/III BTB stores a relatively large amount ofbranch history information per branch instruction for predicting thebranch direction, thereby increasing the size of the BTB.

[0019] A disadvantage of the Athlon integrated BTAC is that theintegration of the BTAC into the instruction cache causes space usageinefficiency. That is, the integrated instruction cache/BTAC occupiesstorage space for caching branch instruction information for non-branchinstructions as well as branch instructions. Much of the space taken upinside the Athlon instruction cache by the additional branch predictioninformation is wasted since the instruction cache has a relatively lowconcentration of branch instructions. For example, a given instructioncache line may have no branches in it, and thus all the space taken upby storing the target addresses and other branch prediction informationin the line are unused and wasted.

[0020] Another disadvantage of the Athlon integrated BTAC is that ofconflicting design goals. That is, the instruction cache size may bedictated by design goals that are different from the design goals of thebranch prediction mechanism. Requiring the BTAC to be the same size asthe instruction cache, in terms of cache lines, which is inherent in theAthlon scheme, may not optimally meet both sets of design goals. Forexample, the instruction cache size may be chosen to achieve a certaincache-hit ratio. However, it may be that the required branch targetaddress prediction rate might have been achieved with a smaller BTAC.

[0021] Furthermore, because the BTAC is integrated with the instructioncache, the data access time to obtain the cached branch target addressis by necessity the same as the access time of the cached instructionbytes. In the case of the relatively large Athlon instruction cache, theaccess time may be relatively long. A smaller, non-integrated BTAC mighthave a data access time substantially less than the access time of theintegrated instruction cache/BTAC.

[0022] The Pentium II/III method does not suffer many of the Athlonintegrated instruction cache/BTAC problems mentioned since the PentiumII/III BTB is not integrated with the instruction cache. However,because the Pentium II/III BTB is indexed with the instruction pointerof an already decoded instruction, rather than the instruction cachefetch address, the Pentium II/III solution potentially may not be ableto branch as early as the Athlon solution, and therefore, may not reducethe branch penalty as effectively. The Pentium II/III solutionpotentially addresses this problem by indexing the BTB with theinstruction pointer of a previous instruction, or previous instructiongroup, rather than the actual branch instruction pointer, as mentionedabove.

[0023] However, a disadvantage of the Pentium II/III method is that someamount of branch prediction accuracy is sacrificed by using theinstruction pointer of a previous instruction, rather than the actualbranch instruction pointer. The reduction in accuracy is due, in part,because the branch instruction may be reached via multiple instructionpaths in the program. That is, instruction pointers of multiple previousinstructions to the branch instruction may be cached in the BTB for thesame branch instruction. Consequently, multiple entries must be consumedin the BTB for such a branch instruction, thereby reducing the overallnumber of branch instructions that may be cached in the BTB. The greaterthe number of instructions previous to the branch instruction used, thegreater the number of paths by which the branch instruction may bereached.

[0024] Additionally, because using a prior instruction pointerintroduces the possibility of multiple paths to the same branchinstruction, it potentially takes the Pentium II/III direction predictorin the BTB longer to “warm up”. The Pentium II/III BTB maintains branchhistory information for predicting the direction of the branch. When anew branch instruction is brought into the processor and cached, themultiple paths to the branch instruction potentially cause the branchhistory to become updated more slowly than would be the case if only asingle path to the branch instruction were possible, resulting in lessaccurate predictions.

[0025] Therefore, what is needed is a branch prediction apparatus thatmakes efficient use of chip real estate, but also provides accuratebranching early in the pipeline to reduce branch penalty.

SUMMARY

[0026] The present invention provides a branch prediction method andapparatus that makes efficient use of chip real estate, but alsoprovides accurate branching early in the pipeline to reduce branchpenalty. Accordingly, in attainment of the aforementioned object, it isa feature of the present invention to provide an apparatus in amicroprocessor for detecting that the microprocessor erroneouslybranched to a speculative target address that is provided by a branchtarget address cache (BTAC). The apparatus includes a storage elementthat stores an indication of whether the microprocessor branched to thespeculative target address provided by the BTAC without knowing whetheran instruction associated with the indication is a branch instruction.The apparatus also includes instruction decode logic that receives anddecodes the instruction subsequent to the microprocessor branching tothe speculative target address. The apparatus also includes predictioncheck logic, coupled to the instruction decode logic, that notifiesbranch control logic that the microprocessor erroneously branched to thespeculative target address if the instruction decode logic indicates theinstruction is not a branch instruction and the indication indicates themicroprocessor branched to the speculative target address.

[0027] In another aspect, it is a feature of the present invention toprovide an apparatus in a microprocessor for detecting that themicroprocessor erroneously speculatively branched to a target addressthat is provided by a speculative branch target address cache (BTAC).The apparatus includes a storage element that stores an indication ofwhether the microprocessor speculatively branched to the target addressprovided by the BTAC based on an instruction cache fetch address withoutfirst determining whether a branch instruction is present in a line ofinstruction bytes in the instruction cache selected by the fetchaddress. The apparatus also includes instruction decode logic thatreceives and decodes the instruction bytes in the instruction cache linesubsequent to the microprocessor speculatively branching to the targetaddress. The instruction decode logic indicates whether the lineincludes a branch instruction. The apparatus also includes predictioncheck logic, coupled to the instruction decode logic, that provides anerror signal to branch control logic if the indication indicates themicroprocessor speculatively branched to the target address and theinstruction decode logic indicates the line does not include a branchinstruction.

[0028] In another aspect, it is a feature of the present invention toprovide a microprocessor for detecting and correcting an erroneousspeculative branch. The microprocessor includes an instruction cachethat provides a line of instruction bytes selected by a fetch address.The fetch address is provided to the instruction cache on an addressbus. The microprocessor also includes a speculative branch targetaddress cache (BTAC), coupled to the address bus, that provides aspeculative target address of a previously executed branch instructionin response to the fetch address whether or not the previously executedbranch instruction is present in the line. The microprocessor alsoincludes control logic, coupled to the BTAC, that controls a multiplexerto select the speculative target address as the fetch address during afirst period. The microprocessor also includes prediction check logic,coupled to the BTAC, that detects that the control logic controlled themultiplexer to select the speculative target address erroneously. Thecontrol logic is further configured to control the multiplexer to selecta correct address as the fetch address during a second period inresponse to the prediction check logic detecting the erroneousselection.

[0029] In another aspect, it is a feature of the present invention toprovide a method for recovering from an erroneous branch to aspeculative target address. The method includes generating a speculativetarget address for a branch instruction that is presumed present in aninstruction cache line selected by a fetch address, branching to thespeculative target address whether or not the presumed branchinstruction is present in the instruction cache line, and generating acorrect target address of the presumed branch instruction subsequent tothe generating the speculative target address. The method also includesdetermining if the speculative target address matches the correct targetaddress, and branching to the correct target address if the speculativetarget address does not match the correct target address.

[0030] In another aspect, it is a feature of the present invention toprovide a method for recovering from an erroneous branch to aspeculative target address for a branch instruction, the branchinstruction being presumably present in a line of instructions, the lineof instructions being provided by an instruction cache in response to afetch address, the speculative target address being speculativelygenerated by a branch target address cache (BTAC) in response to thefetch address. The method includes decoding the presumed branchinstruction subsequent to the BTAC speculatively generating thespeculative target address, determining if the presumed branchinstruction is a non-branch instruction in response to the decoding, andbranching to an instruction pointer of the presumed branch instructionif the presumed branch instruction is a non-branch instruction.

[0031] In another aspect, it is a feature of the present invention toprovide a method for recovering from an erroneous branch to aspeculative target address, the speculative target address beingassociated with a branch instruction that is presumably present in acache line selected by a fetch address, the speculative target addressbeing provided by a branch target address cache (BTAC) in response tothe fetch address. The method includes decoding the presumed branchinstruction subsequent to the BTAC providing the speculative targetaddress, determining a length of the presumed branch instruction, andbranching to an instruction pointer of the presumed branch instructionif the length of the presumed branch instruction does not match aninstruction length speculatively provided by the branch target addresscache.

[0032] In another aspect, it is a feature of the present invention toprovide a method for recovering from an erroneous branch to aspeculative target address. The method includes generating a speculativetarget address of a branch instruction that is presumed present in aninstruction cache line selected by a fetch address, generating aspeculative direction prediction of the presumed branch instruction, andbranching to the speculative target address whether or not the presumedbranch instruction is present in the instruction cache line. The methodalso includes generating a correct direction of the presumed branchinstruction subsequent to the generating the speculative directionprediction, determining if the correct direction is not taken, andbranching to an instruction pointer of a next instruction after thepresumed branch instruction if the correct direction is not taken.

[0033] In another aspect, it is a feature of the present invention toprovide an apparatus in a microprocessor for detecting an erroneousbranch to a speculative return address that is provided by a speculativecall/return stack. The apparatus includes a storage element that storesan indication of whether the microprocessor branched to the speculativereturn address without knowing whether or not an instruction associatedwith the indication is a branch instruction. The apparatus also includesinstruction decode logic that receives and decodes the instructionsubsequent to the microprocessor branching to the speculative returnaddress. The apparatus also includes prediction check logic, coupled tothe instruction decode logic, that notifies branch control logic thatthe microprocessor erroneously branched to the speculative returnaddress if the instruction decode logic indicates that the instructionis not a branch instruction and the indication indicates that themicroprocessor branched to the speculative return address.

[0034] In another aspect, it is a feature of the present invention toprovide a microprocessor for detecting and correcting an erroneousspeculative branch. The microprocessor includes an instruction cachethat provides a line of instruction bytes selected by a fetch address.The microprocessor also includes a speculative call/return stack thatprovides a speculative return address of a previously executed branchinstruction in response to the fetch address. The speculative returnaddress is provided whether or not the previously executed branchinstruction is present in the line of instruction bytes. Themicroprocessor also includes control logic, coupled to the speculativecall/return stack, that controls a multiplexer to select the speculativereturn address to be the fetch address during a first period. Themicroprocessor also includes prediction check logic, coupled to thecontrol logic, that detects that the control logic controlled themultiplexer to select the speculative return address erroneously. Thecontrol logic also controls the multiplexer to select a correct addressto be the fetch address during a second period. The control logicselects the correct address in response to the prediction check logicdetecting that the control logic controlled the multiplexer to selectthe speculative return address erroneously.

[0035] In another aspect, it is a feature of the present invention toprovide a method in a microprocessor for recovering from an erroneousbranch to a speculative target address of a presumed branch instruction.The method includes providing a speculative target address in responseto an instruction cache fetch address, producing an instruction cacheline in response to the instruction cache fetch address, and decoding aninstruction from the instruction cache line subsequent to the providingthe speculative target address. The decoding is performed for a firsttime by the microprocessor for the instruction. The method also includesbranching to the speculative target address prior to the decoding, andbranching to a correct target address of the instruction subsequent tothe branching to the speculative target address in response to thedecoding.

[0036] In another aspect, it is a feature of the present invention toprovide a method for recovering from an erroneous branch to aspeculative target address. The method includes providing a speculativetarget address for a branch instruction that is presumed present in aninstruction cache line that is selected by a fetch address, branching tothe speculative target address whether or not the presumed branchinstruction is present in the instruction cache line, and correctingfrom an erroneous branch if the presumed branch instruction is notpresent in the instruction cache line.

[0037] In another aspect, it is a feature of the present invention toprovide a branch apparatus in a microprocessor for detecting when themicroprocessor erroneously branches to a speculative target address, thespeculative target address being provided by a branch target addresscache (BTAC). The apparatus includes a branch hit indicator, provided toindicate when the microprocessor branches to the speculative targetaddress. The branch hit indicator is provided whether or not aninstruction associated with the branch hit indicator is a branchinstruction. The apparatus also includes instruction decode logic thatreceives and decodes the instruction and specifies whether theinstruction is a branch instruction. The apparatus also includesprediction check logic, coupled to the instruction decode logic, thatdetermines that the microprocessor erroneously branched to thespeculative target address. The microprocessor erroneously branched tothe speculative target address when the instruction decode logicspecifies that the instruction is not a branch instruction, and thebranch hit indicator indicates that the microprocessor branched to thespeculative target address.

[0038] An advantage of the present invention is that it ensures properprogram execution in a processor that employs a speculative BTAC, whichhas the potential advantages of more efficient use of integrated circuitreal estate, improved processor cycle time and/or reduced processorclocks per instruction, and improved likelihood of single-cycle BTACcache realization.

[0039] Other features and advantages of the present invention willbecome apparent upon study of the remaining portions of thespecification and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0040]FIG. 1 is a prior art block diagram of relevant portions of aPentium II/III processor.

[0041]FIG. 2 is a prior art block diagram of relevant portions of anAthlon processor.

[0042]FIG. 3 is a block diagram illustrating a pipelined microprocessoraccording to the present invention.

[0043]FIG. 4 is a speculative branch prediction apparatus of theprocessor of FIG. 3 according to the present invention.

[0044]FIG. 5 is a block diagram of the instruction cache of FIG. 4.

[0045]FIG. 6 is a block diagram of the branch target address cache(BTAC) of FIG. 4 according to the present invention.

[0046]FIG. 7 is a block diagram of the format of an entry of FIG. 6 ofthe BTAC of FIG. 4 according to the present invention.

[0047]FIG. 8 is a flowchart illustrating operation of the speculativebranch prediction apparatus of FIG. 4 according to the presentinvention.

[0048]FIG. 9 is a block diagram illustrating an example of operation ofthe speculative branch prediction apparatus of FIG. 4 using the steps ofFIG. 8 to select a target address according to the present invention.

[0049]FIG. 10 is a flowchart illustrating operation of the speculativebranch prediction apparatus of FIG. 4 to detect and correct erroneousspeculative branch predictions according to the present invention.

[0050]FIG. 11 is sample code fragments and a table illustrating anexample of the speculative branch misprediction detection and correctionof FIG. 10 according to the present invention.

[0051]FIG. 12 is a block diagram illustrating an alternate embodiment ofthe branch prediction apparatus of FIG. 4 including a hybrid speculativebranch direction predictor according to the present invention.

[0052]FIG. 13 is a flowchart illustrating operation of the dualcall/return stacks of FIG. 4.

[0053]FIG. 14 is a flowchart illustrating operation of the branchprediction apparatus of FIG. 4 to selectively override speculativebranch predictions with non-speculative branch predictions therebyimproving the branch prediction accuracy of the present invention.

[0054]FIG. 15 is a block diagram illustrating an apparatus for replacinga target address in the BTAC of FIG. 4 according to the presentinvention.

[0055]FIG. 16 is a flowchart illustrating a method of operation of theapparatus of FIG. 15 according to the present invention.

[0056]FIG. 17 is a flowchart illustrating a method of operation of theapparatus of FIG. 15 according to an alternate embodiment of the presentinvention.

[0057]FIG. 18 is a block diagram illustrating an apparatus for replacinga target address in the BTAC of FIG. 4 according to an alternateembodiment of the present invention.

[0058]FIG. 19 is a block diagram illustrating an apparatus for replacinga target address in the BTAC of FIG. 4 according to an alternateembodiment of the present invention.

DETAILED DESCRIPTION

[0059] Referring now to FIG. 3, a block diagram illustrating a pipelinedmicroprocessor 300 according to the present invention is shown. Theprocessor pipeline 300 includes a plurality of stages 302 through 332.

[0060] The first stage is the I-stage 302, or instruction fetch stage.The I-stage 302 is the stage where the processor 300 provides a fetchaddress to an instruction cache 432 (see FIG. 4) in order to fetchinstructions for the processor 300 to execute. The instruction cache 432is described in more detail with respect to FIG. 4. In one embodiment,the instruction cache 432 is a two-cycle cache. A B-stage 304 is thesecond stage of the instruction cache 432 access. The instruction cache432 provides its data to a U-stage 306, where the data is latched in.The U-stage 306 provides the instruction cache data to a V-stage 308.

[0061] In the present invention, the processor 300 further comprises aspeculative branch target address cache (BTAC) 402 (see FIG. 4),described in detail with respect to the remaining Figures. The BTAC 402is not integrated with the instruction cache 432. However, the BTAC 402is accessed in parallel with the instruction cache 432 in the I-stage302 using the instruction cache 432 fetch address 495 (see FIG. 4),thereby enabling relatively fast branching to reduce branch penalty. TheBTAC 402 provides a speculative branch target address 352 that isprovided to the I-stage 302. The processor 300 selectively chooses thetarget address 352 as the instruction cache 432 fetch address to achievea branch to the speculative target address 352, as described in detailwith respect to the remaining Figures.

[0062] Advantageously, as may be seen from FIG. 3, the branch targetaddress 352 supplied by the branch target address cache 402 in theU-stage 306 enables the processor 300 to branch relatively early in thepipeline 300, creating only a two-cycle instruction bubble. That is, ifthe processor 300 branches to the speculative target address 352, onlytwo stages worth of instructions must be flushed. In other words, withintwo cycles, the target instructions of the branch will be available atthe U-stage 306 in the typical case, i.e., if the target instructionsare present in the instruction cache 432.

[0063] Advantageously, in most cases, the two-cycle bubble is smallenough that it may be absorbed by an instruction buffer 342, F-stageinstruction queue 344 and/or X-stage instruction queue 346, describedbelow. Consequently, in many cases, the speculative BTAC 402 enables theprocessor 300 to achieve zero-penalty branches.

[0064] The processor 300 further comprises a speculative call/returnstack 406 (see FIG. 4), described in detail with respect to FIGS. 4, 8,and 13. The speculative call/return stack 406 works in conjunction withthe speculative BTAC 402 to generate a speculative return address 353,i.e., a target address of a return instruction that is provided to theI-stage 302. The processor 300 selectively chooses the speculativereturn address 353 as the instruction cache 432 fetch address to achievea branch to the speculative return address 353, as described in detailwith respect to FIG. 8.

[0065] The V-stage 308 is the stage in which instructions are written tothe instruction buffer 342. The instruction buffer 342 buffersinstructions for provision to an F-stage 312. The V-stage 308 alsoincludes decode logic for providing information about the instructionbytes to the instruction buffer 342, such as x86 prefix and mod R/Minformation, and whether an instruction byte is a branch opcode value.

[0066] The F-stage 312, or instruction format stage 312, includesinstruction format and decode logic 436 (see FIG. 4) for formattinginstructions. Preferably, the processor 300 is an x86 processor, whichallows for variable length instructions in its instruction set. Theinstruction format logic 436 receives a stream of instruction bytes fromthe instruction buffer 342 and parses the stream into discrete groups ofbytes constituting an x86 instruction, and in particular providing thelength of each instruction.

[0067] The F-stage 312 also includes branch instruction target addresscalculation logic 416 (see FIG. 4) for generating a non-speculativebranch target addresses 354 based on an instruction decode, rather thanbased speculatively on the instruction cache 432 fetch address, like theBTAC 402 in the I-stage 302. The F-stage 312 also includes a call/returnstack 414 (see FIG. 4) for generating a non-speculative return addresses355 based on an instruction decode, rather than based speculatively onthe instruction cache 432 fetch address, like the I-stage 302 branchtarget address cache 402. The F-stage 312 non-speculative addresses 354and 355 are provided to the I-stage 302. The processor 300 selectivelychooses the F-stage 312 non-speculative address 354 or 355 as theinstruction cache 432 fetch address to achieve a branch to one of theaddresses 354 or 355, as described in detail below.

[0068] An F-stage instruction queue 344 receives the formattedinstructions. Formatted instructions are provided by the F-stageinstruction queue 344 to an instruction translator in the X-stage 314.

[0069] The X-stage 314, or translation stage 314, instruction translatortranslates x86 macroinstructions into microinstructions that areexecutable by the remainder of the pipeline stages. The translatedmicroinstructions are provided by the X-stage 314 to an X-stageinstruction queue 346.

[0070] The X-stage instruction queue 346 provides translatedmicroinstructions to an R-stage 316, or register stage 316. The R-stage316 includes the user-visible x86 register set, in addition to othernon-user-visible registers. Instruction operands for the translatedmicroinstructions are stored in the R-stage 316 registers for executionof the microinstructions by subsequent stages of the pipeline 300.

[0071] An A-stage 318, or address stage 318, includes address generationlogic that receives operands and microinstructions from the R-stage 316and generates addresses required by the microinstructions, such asmemory addresses for load/store microinstructions.

[0072] A D-stage 322, or data stage 322, includes logic for accessingdata specified by the addresses generated by the A-stage 318. Inparticular, the D-stage 322 includes a data cache for caching datawithin the processor 300 from a system memory. In one embodiment, thedata cache is a two cycle cache. A G-stage 324 is the second stage ofthe data cache access, and the data cache data is available in anE-stage 326.

[0073] The E-stage 326, or execution stage 326, includes executionlogic, such as arithmetic logic units, for executing themicroinstructions based on the data and operands provided from previousstages. In particular, the E-stage 326 produces a resolved targetaddress 356 of all branch instructions. That is, the E-stage 326 targetaddress 356 is known to be the correct target address of all branchinstructions with which all predicted target addresses must match. Inaddition, the E-stage 326 produces a resolved direction (DIR) 481 (seeFIG. 4) for all branch instructions.

[0074] An S-stage 328, or store stage 328, performs a store to memory ofthe results of the microinstruction execution received from the E-stage326. In addition, the target address 356 of branch instructionscalculated in the E-stage 326 is provided to the instruction cache 432in the I-stage 302 from the S-stage 328. Furthermore, the BTAC 402 ofthe I-stage 302 is updated from the S-stage 328 with the resolved targetaddresses of branch instructions executed by the pipeline 300 forcaching in the BTAC 402. In addition, other speculative branchinformation (SBI) 454 (see FIG. 4) is updated in the BTAC 402 from theS-stage 328. The speculative branch information 454 includes the branchinstruction length, the location within an instruction cache 432 line ofthe branch instruction, whether the branch instruction wraps overmultiple instruction cache 432 lines, whether the branch is a call orreturn instruction, and information used to predict the direction of thebranch instruction, as described with respect to FIG. 7.

[0075] A W-stage 332, or write-back stage 332, writes back the resultfrom the S-stage 328 into the R-stage 316 registers, thereby updatingthe processor 300 state.

[0076] The instruction buffer 342, F-stage instruction queue 344 andX-stage instruction queue 346, among other things, serve to minimize theimpact of branches upon the clocks per instruction value of theprocessor 300.

[0077] Referring now to FIG. 4, a speculative branch predictionapparatus 400 of the processor 300 of FIG. 3 according to the presentinvention is shown. The processor 300 includes an instruction cache 432for caching instruction bytes 496 from a system memory. The instructioncache 432 is addressed with a fetch address 495 provided on a fetchaddress bus for indexing a line within the instruction cache 432.Preferably, the fetch address 495 comprises a 32-bit virtual address.That is, the fetch address 495 is not a physical memory address of aninstruction. In one embodiment, the virtual fetch address 495 is an x86linear instruction pointer. In one embodiment, the instruction cache 432is 32-bytes wide; hence, only the upper 27 bits of the fetch address 495are used to index the instruction cache 432. A selected cache line 494of instruction bytes is provided on an output of the instruction cache432. The instruction cache 432 is described in more detail with respectto FIG. 5 presently.

[0078] Referring now to FIG. 5, a block diagram of one embodiment of theinstruction cache 432 of FIG. 4 is shown. The instruction cache 432includes logic (not shown) for translating the virtual fetch address 495of FIG. 4 to a physical address. The instruction cache 432 includes atranslation lookaside buffer (TLB) 502 for caching physical addressespreviously translated from virtual fetch addresses 495 by thetranslation logic. In one embodiment, the TLB 502 receives bits [31:12]of the virtual fetch address 495 and provides on its output acorresponding 20-bit physical page number 512 when the virtual fetchaddress 495 hits in the TLB 502.

[0079] The instruction cache 432 includes a data array 506 for cachinginstruction bytes. The data array 506 is arranged as a plurality oflines indexed by a portion of the virtual fetch address 495. In oneembodiment, the data array 506 stores 64KB of instruction bytes arrangedin 32 byte lines. In one embodiment, the data instruction cache 432 is a4-way set associative cache. Hence, the data array 506 comprises 512lines of instruction bytes indexed by bits [13:5] of the fetch address495.

[0080] The line of instruction bytes 494 selected by the virtual fetchaddress 495 is provided on the output of the instruction cache 432 tothe instruction buffer 342 as shown in FIG. 4. In one embodiment, onehalf of the selected line of instruction bytes is provided to theinstruction buffer 342 at a time, i.e., 16 bytes are provided during twoseparate periods each. In the present specification, a cache line orline of instruction bytes may be used to refer to a portion of a lineselected within the instruction cache 432 by the fetch address 495, suchas a half-cache line or other subdivision thereof.

[0081] The instruction cache 432 also includes a tag array 504 forcaching tags. The tag array 504, like the data array 506, is indexed bythe same bits of the virtual fetch address 495. Physical address bitsare cached in the tag array 504 as physical tags. The physical tags 514selected by the fetch address 495 bits are provided on the output of thetag array 504.

[0082] The instruction cache 432 also includes a comparator 508 thatcompares the physical tags 514 with the physical page number 512provided by the TLB 502 to generate a hit signal 518 for indicatingwhether the virtual fetch address 495 hit in the instruction cache 432.That is, the hit signal 518 indicates whether the instructions of thetask currently being executed by the processor 300 at the fetch address495 are cached in the data array 506 of the instruction cache 432. Thehit signal 518 is a true indication of whether the current taskinstructions are cached since the instruction cache 432 converts thevirtual fetch address 495 to a physical address and uses the physicaladdress to determine a cache hit.

[0083] The operation of the instruction cache 432 as just described isin contrast to the BTAC 402 operation, which determines a hit based onlyon a virtual address, i.e., the fetch address 495, not on a physicaladdress. A consequence of the distinction in operation is that virtualaliasing may occur such that the BTAC 402 produces an erroneous targetaddress 352, as described below.

[0084] Referring again to FIG. 4, the instruction buffer 342 of FIG. 3receives the cache line instruction bytes 494 from the instruction cache432 and buffers the instruction bytes 494 until they are formatted andtranslated. As mentioned above with respect to the V-stage 308 of FIG.3, the instruction buffer 342 also stores other information relevant tobranch prediction, such as x86 prefix and mod R/M information, andwhether an instruction byte is a branch opcode value.

[0085] In addition, the instruction buffer 342 stores a speculativelybranched (SB) bit 438 for each instruction byte stored in theinstruction buffer 342. If the processor 300 speculatively branches to aspeculative target address 352 provided by the BTAC 402 or to aspeculative return address 353 provided by the speculative call/returnstack 406 based on SBI 454 cached in the BTAC 402, the SB bit 438 is setfor an instruction byte indicated by the SBI 454. That is, if theprocessor 300 speculatively branches based on a presumption that abranch instruction for which SBI 454 is cached in the BTAC 402 ispresent in the line of instruction bytes 494 provided by the instructioncache 432, the SB bit 438 is set for one of the instruction bytes 494stored in the instruction buffer 342. In one embodiment, the SB bit 438is set for the opcode byte of the presumed branch instruction asindicated by the SBI 454.

[0086] Instruction decode logic 436 receives instruction bytes 493 fromthe instruction buffer 342 in order to decode the instruction bytes 493,including branch instruction bytes, to generate instruction decodeinformation 492. The instruction decode information 492 is used to makebranch instruction predictions and to detect and correct erroneousspeculative branches. The instruction decode logic 436 provides theinstruction decode information 492 to downstream portions of thepipeline 300. In addition, the instruction decode logic 436 generates anext sequential instruction pointer (NSIP) 466 and a current instructionpointer (CIP) 468 when decoding the current instruction. In addition,the instruction decode logic 436 provides instruction decode information492 to the non-speculative target address calculator 416, thenon-speculative call/return stack 414, and the non-speculative branchdirection predictor 412. Preferably, the non-speculative call/returnstack 414, the non-speculative branch direction predictor 412, and thenon-speculative target address calculator 416 reside in the F-stage 312of the pipeline 300.

[0087] The non-speculative branch direction predictor 412 generates anon-speculative prediction of the direction of a branch instruction 444,i.e., whether the branch will be taken or not taken, in response to theinstruction decode information 492 received from the instruction decodelogic 436. Preferably, the non-speculative branch direction predictor412 includes one or more branch history tables for storing a history ofresolved directions of executed branch instructions. Preferably, thebranch history tables are used in conjunction with decode information ofthe branch instruction itself provided by the instruction decode logic436 to predict a direction of conditional branch instructions. Anexemplary embodiment of the non-speculative branch direction predictor412 is described in U.S. patent application Ser. No. 09/434,984 (DocketNumber CNTR:1498) HYBRID BRANCH PREDICTOR WITH IMPROVED SELECTOR TABLEUPDATE MECHANISM, having a common assignee and which is herebyincorporated by reference. Logic that ultimately resolves the directionof the branch instruction preferably resides in the E-stage 326 of thepipeline 300.

[0088] The non-speculative call/return stack 414 generates thenon-speculative return address 355 of FIG. 3 in response to theinstruction decode information 492 received from the instruction decodelogic 436. Among other things, the instruction decode information 492indicates whether the currently decoded instruction is a callinstruction, a return instruction, or neither.

[0089] In addition, the instruction decode information 492 includes areturn address 488 if the instruction currently being decoded by theinstruction decode logic 436 is a call instruction. Preferably, thereturn address 488 comprises the value of the instruction pointer of thecurrently decoded call instruction plus the length of the callinstruction. The return address 488 is pushed onto the non-speculativecall/return stack 414 when the instruction decode information 492indicates the instruction is a call instruction so that the returnaddress 488 can be provided as non-speculative return address 355 uponsubsequent decode of a return instruction by the instruction decodelogic 436.

[0090] An exemplary embodiment of the non-speculative call/return stack414 is described in U.S. patent application Ser. No. 09/271,591 (DocketNumber CNTR:1500) METHOD AND APPARATUS FOR CORRECTING AN INTERNALCALL/RETURN STACK IN A MICROPROCESSOR THAT SPECULATIVELY EXECUTES CALLAND RETURN INSTRUCTIONS, having a common assignee and which is herebyincorporated by reference.

[0091] The non-speculative target address calculator 416 generates thenon-speculative target address 354 of FIG. 3 in response to theinstruction decode information 492 received from the instruction decodelogic 436. Preferably, the non-speculative target address calculator 416includes an arithmetic logic unit for calculating a branch targetaddress of PC-relative or direct type branch instructions. Preferably,the arithmetic logic unit adds an instruction pointer and length of thebranch instruction to a signed offset comprised in the branchinstruction to calculate the target address of PC-relative type branchinstructions. Preferably, the non-speculative target address calculator416 includes a relatively small branch target buffer (BTB) for cachingbranch target addresses of indirect type branch instructions. Anexemplary embodiment of the non-speculative target address calculator416 is described in U.S. patent application Ser. No. 09/438,907 (DocketNumber CNTR:1507) APPARATUS FOR PERFORMING BRANCH TARGET ADDRESSCALCULATION BASED ON BRANCH TYPE, having a common assignee and which ishereby incorporated by reference.

[0092] The branch prediction apparatus 400 includes the speculativebranch target address cache (BTAC) 402. The BTAC 402 is addressed with afetch address 495 provided on the fetch address bus for indexing a linewithin the BTAC 402. The BTAC 402 is not integrated with the instructioncache 432, but rather, is separate and distinct from the instructioncache 432, as shown. That is, the BTAC 402 is distinct from theinstruction cache 432, both physically and conceptually. The BTAC 402 isphysically distinct from the instruction cache 432 in that it isspatially located in a different location within the processor 300 thanthe instruction cache 432. The BTAC 402 and instruction cache 432 areconceptually distinct in that they are different in size, i.e., in oneembodiment they comprise a different number of cache lines. The BTAC 402and instruction cache 432 are also conceptually distinct in that theinstruction cache 432 converts the fetch address 495 to a physicaladdress for determining a hit of a line of instruction bytes; whereas,the BTAC 402 is indexed by the virtual fetch address 495 as a virtualaddress, without converting to a physical address.

[0093] Preferably, the BTAC 402 resides in the I-stage 302 of thepipeline 300. The BTAC 402 caches target addresses of previouslyexecuted branch instructions. When the processor 300 executes a branchinstruction, the resolved target address of the branch instruction iscached in the BTAC 402 via update signals 442. The instruction pointer(IP) 1512 (see FIG. 15) of the branch instruction is used to update theBTAC 402, as described below with respect to FIG. 15.

[0094] To generate the cached branch target address 352 of FIG. 3, theBTAC 402 is indexed by the instruction cache 432 fetch address 495 inparallel with the instruction cache 432. The BTAC 402 provides thespeculative branch target address 352 in response to the fetch address495. Preferably, all 32-bits of the fetch address 495 are used to selectthe speculative target address 352 from the BTAC 402, as will bedescribed in more detail below, primarily with respect to FIGS. 6through 9. The speculative branch target address 352 is provided toaddress selection logic 422 comprising a multiplexer 422.

[0095] The multiplexer 422 selects the fetch address 495 from among aplurality of addresses, including the BTAC 402 target address 352, aswill be discussed below. The multiplexer 422 output provides the fetchaddress 495 to the instruction cache 432 and BTAC 402. If themultiplexer 422 selects the BTAC 402 target address 352, then theprocessor 300 will branch to the BTAC 402 target address 352. That is,the processor 300 will begin fetching instructions from the instructioncache 432 at the BTAC 402 target address 352.

[0096] In one embodiment, the BTAC 402 is smaller than the instructioncache 432. In particular, the BTAC 402 caches target addresses for asmaller number of cache lines than are comprised in the instructioncache 432. A consequence of the BTAC 402 not being integrated with theinstruction cache 432, yet using the instruction cache 432 fetch address495 as an index, is that if the processor 300 branches to the targetaddress 352 generated by the BTAC 402 it does so speculatively. Thebranch is speculative because there is no certainty that a branchinstruction resides in the selected instruction cache 432 line at all,much less that the branch instruction for which the target address 352was cached. A hit in the BTAC 402 only indicates that a branchinstruction was previously present in the instruction cache 432 lineselected by the fetch address 495. There are at least two reasons thereis no certainty a branch instruction is present in the selected cacheline.

[0097] A first reason there is no certainty that a branch instruction isin the instruction cache 432 line indexed by the fetch address 495 isbecause the fetch address 495 is a virtual address; therefore, virtualaliasing may occur. That is, two different physical addresses may aliasto the same virtual fetch address 495. A given fetch address 495, whichis virtual, may translate to two different physical addresses associatedwith two different processes or tasks of a multitasking processor suchas processor 300. The instruction cache 432 performs virtual to physicaltranslation using the translation lookaside buffer 502 of FIG. 5 inorder to provide the correct instruction data. However, the BTAC 402performs its lookup based on the virtual fetch address 495 withoutperforming virtual to physical address translation. Avoiding virtual tophysical address translation by the BTAC 402 is advantageous because itenables the speculative branch to be performed faster than if virtual tophysical address translation was performed.

[0098] The operating system performing a task switch provides an exampleof a situation in which the virtual aliasing condition may occur. Afterthe task switch, the processor 300 will fetch instructions from theinstruction cache 432 at a virtual fetch address 495 associated with thenew process equal to a virtual fetch address 495 of the old process thatincludes a branch instruction whose target address 352 is cached in theBTAC 402. The instruction cache 432 will produce the instructions forthe new process based on the physical address translated from thevirtual fetch address 495, as described above with respect to FIG. 5;however, the BTAC 402 will generate a target address 352 for the oldprocess using only the virtual fetch address 495, thereby causing anerroneous branch. Advantageously, the erroneous speculative branch willonly occur the first time the new process instruction is executedbecause the BTAC 402 target address 352 will be invalidated after theerror is discovered, as will be described below with respect to FIG. 10.

[0099] Thus, a branch to the BTAC 402 target address 352 is speculativebecause in some situations the processor 300 will branch to an incorrecttarget address 352 generated by the BTAC 402 because a branchinstruction is not present in the instruction cache 432 at the fetchaddress 495, due to virtual aliasing, for example. In contrast, theAthlon integrated BTAC/instruction cache 202 of FIG. 2 and the PentiumII/III branch target buffer 134 of FIG. 1 described above arenon-speculative in this respect. In particular, the Athlon method isnon-speculative since it is presumed virtual aliasing does not occurbecause the Athlon stores the target address 206 of FIG. 2 alongside thebranch instruction bytes 108 themselves. That is, the Athlon BTAC 202lookup is performed based on a physical address. The Pentium II/IIImethod is non-speculative since the branch target buffer 134 generates abranch target address 136 only after the branch instruction has beenfetched from the instruction cache 102 and the instruction decode logic132 determines that a branch instruction is actually present.

[0100] In addition, the non-speculative target address calculator 416,non-speculative call/return stack 414, and non-speculative branchdirection predictor 412 predictions are also non-speculative becausethey generate branch predictions only after the branch instruction hasbeen fetched from the instruction cache 432 and has been decoded by theinstruction decode logic 436, as will be described below.

[0101] It should be understood that although the direction prediction444 generated by the non-speculative branch direction predictor 412 is“non-speculative,” i.e., made with the certainty that a branchinstruction exists in the current instruction stream because the branchinstruction has been decoded by the instruction decode logic 436, thenon-speculative direction prediction 444 is a “prediction” nevertheless.That is, if the branch instruction is a conditional branch instruction,such as an x86 JCC instruction, the branch may or may not be taken inany given execution of the branch instruction.

[0102] Similarly, the target address 354 generated by thenon-speculative target address calculator 416 and the return address 355generated by the non-speculative call/return stack 414 arenon-speculative since they are generated with the certainty that abranch instruction exists in the current instruction stream; but theyare still predictions, nevertheless. For example, in the case of an x86indirect jump through memory, the memory contents may have changed sincethe last time the indirect jump was executed. Hence, the target addressmay have changed accordingly. Thus, “non-speculative” in this context isnot to be confused with “unconditional” as to branch direction or“certain” as to target address. Similarly, “speculative” in this contextis not to be confused with “prediction” or “non-certain” as to branchdirection or target address.

[0103] A second reason there is no certainty that the branch instructionis in the instruction cache 432 line indexed by the fetch address 495 isthe existence of self-modifying code. Self-modifying code may change thecontents of the instruction cache 432, but the change is not reflectedin the BTAC 402. Hence, a BTAC 402 hit may occur for a line of theinstruction cache 432 that previously included a branch instruction, butwhich has been modified or replaced by a different instruction.

[0104] The branch prediction apparatus 400 also includes the speculativecall/return stack 406. The speculative call/return stack 406 storesspeculative target addresses for return instructions. The speculativecall/return stack 406 generates the speculative return address 353 ofFIG. 3 in response to control signals 483 generated by control logic404. The speculative return address 353 is supplied to an input of themultiplexer 422. When the multiplexer 422 selects the speculative returnaddress 353 generated by the speculative call/return stack 406, theprocessor 300 branches to the speculative return address 353.

[0105] The control logic 404 generates control signals 483 to controlthe speculative call/return stack 406 to provide the speculative returnaddress 353 when the BTAC 402 indicates a return instruction may bepresent in a line of the instruction cache 432 specified by the fetchaddress 495. Preferably, the BTAC 402 indicates a return instruction maybe present in a line of the instruction cache 432 specified by the fetchaddress 495 when the selected BTAC 402 entry 602 VALID 702 and RET 706bits (see FIG. 7) are set and a BTAC 402 HIT signal 452 indicates a hitin the BTAC 402 tag array 614 (see FIG. 6).

[0106] The BTAC 402 generates the HIT signal 452 and speculative branchinformation (SBI) 454 in response to the fetch address 495. The HITsignal 452 indicates that the fetch address 495 generated a cache taghit in the BTAC 402, described below with respect to FIG. 6. The SBI 454is also described more thoroughly below with respect to FIG. 6.

[0107] The SBI 454 includes a BEG 446 signal (branch instructionbeginning byte offset within a line in the instruction cache 432) and aLEN 448 signal (branch instruction length). The BEG 446 value, the LEN448 value and the fetch address 495 are added together by an adder 434to generate a return address 491. The return address 491 is provided onthe adder 434 output to the speculative call/return stack 406 so thatthe return address 491 can be pushed onto the speculative call/returnstack 406. The control logic 404 operates the speculative call/returnstack 406 in conjunction with the BTAC 402 via signals 483 to push thereturn address 491. The return address 491 is pushed only if theselected BTAC 402 entry 602 VALID 702 and CALL 704 bits (see FIG. 7) areset and the HIT signal 452 indicates a hit in the BTAC 402 tag array 614(see FIG. 6) Operation of the speculative call/return stack 406 will bedescribed in more detail below with respect to FIGS. 8 and 13.

[0108] The branch prediction apparatus 400 also includes the controllogic 404. The control logic 404 controls multiplexer 422 via controlsignals 478 to select one of the plurality of address inputs to be thefetch address 495. The control logic 404 also sets the SB bits 438 inthe instruction buffer 342 via signal 482.

[0109] The control logic 404 receives the HIT signal 452, the SBI 454,the non-speculative branch direction prediction 444 from thenon-speculative branch direction predictor 412, and a FULL signal 486from the instruction buffer 342.

[0110] The branch prediction apparatus 400 also includes predictioncheck logic 408. The prediction check logic 408 generates an ERR signal456, which is provided to the control logic 404 to indicate that anerroneous speculative branch was performed based on a BTAC 402 hit, asdescribed below with respect to FIG. 10. The prediction check logic 408receives the SB bits 438 from the instruction buffer 342 via signal 484,which is also provided to the control logic 404. The prediction checklogic 408 also receives the SBI 454 from the BTAC 402. The predictioncheck logic 408 also receives instruction decode information 492 fromthe instruction decode logic 436. The prediction check logic 408 alsoreceives the resolved branch direction DIR 481 produced by the E-stage326 of FIG. 3.

[0111] The prediction check logic 408 also receives the output 485 of acomparator 489. The comparator 489 compares the speculative targetaddress 352 generated by the BTAC 402 and the resolved target address356 of FIG. 3 produced by the E-stage 326. The BTAC 402 speculativetarget address 352 is registered and piped down the instruction pipeline300 to the comparator 489.

[0112] The prediction check logic 408 also receives the output 487 of acomparator 497. The comparator 497 compares the speculative returnaddress 353 generated by the speculative call/return stack 406 and theresolved target address 356. The speculative return address 353 isregistered and piped down the instruction pipeline 300 to the comparator497.

[0113] The BTAC 402 speculative target address 352 is also registeredand piped down the instruction pipeline 300 for comparison with thenon-speculative target address calculator 416 target address 354 by acomparator 428. The comparator 428 output 476 is provided to the controllogic 404. Similarly, the speculative return address 353 generated bythe speculative call/return stack 406 is also registered and piped downthe instruction pipeline 300 for comparison with the non-speculativereturn address 355 by a comparator 418. The comparator 418 output 474 isalso provided to the control logic 404.

[0114] The branch prediction apparatus 400 also includes a savemultiplexed/register 424. The save mux/reg 424 is controlled by acontrol signal 472 generated by the control logic 404. The output 498 ofthe save mux/reg 424 is provided as an input to the multiplexer 422. Thesave mux/reg 424 receives as inputs its own output 498 and the BTAC 402speculative target address 352.

[0115] The multiplexer 422 also receives as an input the S-stage 328branch address 356. The multiplexer 422 also receives as an input thefetch address 495 itself. The multiplexer 422 also receives as an inputa next sequential fetch address 499 generated by an incrementer 426,that receives the fetch address 495 and increments it to the nextsequential instruction cache 432 line.

[0116] Referring now to FIG. 6, a block diagram of the BTAC 402 of FIG.4 according to the present invention is shown. In the embodiment shownin FIG. 6, the BTAC 402 comprises a 4-way set-associative cache. TheBTAC 402 comprises a data array 612 and a tag array 614. The data array612 comprises an array of storage elements for storing entries forcaching branch target addresses and speculative branch information. Thetag array 614 comprises an array of storage elements for storing addresstags.

[0117] Each of the data array 612 and tag array 614 is organized intofour ways, shown as way 0, way 1, way 2, and way 3. Preferably, each ofthe data array 612 ways stores two entries for caching a branch targetaddress and speculative branch information, designated A and B. Hence,the data array 612 generates eight entries 602 each time it is read. Theeight entries 602 are provided to an 8:2 way select mux 606.

[0118] Each of the data array 612 and tag array 614 is indexed by theinstruction cache 432 fetch address 495 of FIG. 4. The lower significantbits of the fetch address 495 select a line within each of the arrays612 and 614. In one embodiment, each of the arrays comprises 128 lines.Hence, the BTAC 402 is capable of caching up to 1024 target addresses, 2for each of the 4 ways for each of the 128 lines. Preferably, the arrays612 and 614 are indexed with bits [11:5] of the fetch address 495.

[0119] The tag array 614 generates a tag 616 for each way. Preferably,each tag 616 comprises 20 bits of virtual address, and each of the fourtags 616 is compared with bits [31:12] of the fetch address 495 by ablock of comparators 604. The comparators 604 generate the HIT signal452 of FIG. 4 to indicate whether a hit of the BTAC 402 has occurredbased on whether one of the tags 616 matches the most significant bitsof the fetch address 495. The HIT signal 452 is provided to the controllogic 404 of FIG. 4.

[0120] In addition, the comparators 604 generate control signals 618 tocontrol the way select mux 606. In response, the way select mux 606selects the A and B entry, 624 and 626, respectively, of one of the fourways in the line generated by the BTAC 402. The A entry 624 and B entry626 are provided to an A/B select mux 608 and to the control logic 404.The control logic 404 generates a control signal 622 to control the A/Bselect mux 608 in response to the HIT 452 signal, entry A 624 and entryB 626, the fetch address 495 and other control signals. In response, theA/B select mux 608 selects one of entry A 624 or entry B 626 as the BTAC402 target address 352 of FIG. 3 and SBI 454 of FIG. 4.

[0121] Preferably, the BTAC 402 is a single-ported cache. Asingle-ported cache has the advantage of being smaller, and thereforeable to cache more target addresses than a dual-ported cache in the sameamount of space. However, a dual-ported cache is contemplated tofacilitate simultaneous reads and writes of the BTAC 402. Thesimultaneous read and write feature of the dual-ported BTAC 402 enablesfaster updates of the BTAC 402 since the updating writes do not have towait for reads. The faster updates generally result in a more accurateprediction, since the information in the BTAC 402 is more current.

[0122] In one embodiment, the instruction cache 432 lines comprise 32bytes each. However, the instruction cache 432 provides a half-cacheline of instruction bytes 494 at time. In one embodiment, each line ofthe BTAC 402 stores two entries 602, and therefore two target addresses714, per half-cache line of the instruction cache 432.

[0123] Referring now to FIG. 7, a block diagram of the format of anentry 602 of FIG. 6 of the BTAC 402 of FIG. 4 according to the presentinvention is shown. The entry 602 comprises the SBI (speculative branchinformation) 454 of FIG. 4 and a branch target address (TA) 714. The SBI454 comprises a VALID bit 702, the BEG 446 and LEN 448 of FIG. 4, a CALLbit 704, a RET bit 706, a WRAP bit 708, and branch direction predictioninformation (BDPI) 712. After the pipeline 300 of FIG. 3 executes abranch, the resolved target address of the branch is cached in the TAfield 714, and the SBI 454 obtained from decoding and executing thebranch instruction is cached in the SBI 454 field of an entry 602 of theBTAC 402.

[0124] The VALID bit 702 indicates whether the entry 602 may be used forspeculatively branching the processor 300 to the associated targetaddress 714. In particular, the VALID bit 702 is initially clearedbecause the BTAC 402 is empty since no valid target addresses have beencached. The VALID bit 702 is set when the processor 300 executes abranch instruction and the resolved target address and speculativebranch information associated with the branch instruction is cached inthe entry 602. Subsequently, the VALID bit 702 is cleared if the BTAC402 makes an erroneous prediction based on the entry 602, as describedbelow with respect to FIG. 10.

[0125] The BEG field 446 specifies the branch instruction beginning byteoffset within a line in the instruction cache 432. The BEG field 446 isused to calculate a return address for storage in the speculativecall/return stack 406 of FIG. 4 upon detection of a call instructionhitting in the BTAC 402. Additionally, the BEG field 446 is used todetermine which if either of the entry A 624 or entry B 626 of FIG. 6 ofa selected BTAC 402 way should result in a BTAC 402 hit, as will bedescribed below with respect to FIG. 8. Preferably, the branchinstruction locations specified by entry A 624 and entry B 626 need notbe in any particular location order within the instruction cache 432line. That is, the entry B 626 branch instruction may be earlier in theinstruction cache 432 line than the entry A 624 branch instruction.

[0126] The LEN 448 field specifies the length in bytes of the branchinstruction. The LEN field 448 is used to calculate a return address forstorage in the speculative call/return stack 406 of FIG. 4 upondetection of a call instruction hitting in the BTAC 402.

[0127] The CALL bit 704 indicates whether the cached target address 714is associated with a call instruction. That is, if a call instructionwas executed by the processor 300 and the target address of the callinstruction was cached in the entry 602, then the CALL bit 704 will beset.

[0128] The RET bit 706 indicates whether the cached target address 714is associated with a return instruction. That is, if a returninstruction was executed by the processor 300 and the target address ofthe return instruction was cached in the entry 602, then the RET bit 706will be set.

[0129] The WRAP bit 708 is set if the branch instruction bytes span twoinstruction cache 432 lines. In one embodiment, the WRAP bit 708 is setif the branch instruction bytes span two instruction cache 432helf-lines

[0130] The BDPI (branch direction prediction information) field 712comprises a T/NT (taken/not taken) field 722 and a SELECT bit 724. TheT/NT field 722 comprises a direction prediction of the branch, i.e., itindicates whether the branch is predicted taken or not taken.Preferably, the T/NT field 722 comprises a two-bit up/down saturatingcounter, for specifying the four states strongly taken, weakly taken,weakly not taken, and strongly not taken. In another embodiment, theT/NT field 722 comprises a single T/NT bit.

[0131] The SELECT bit 724 is used to select between the BTAC 402 T/NTdirection prediction 722 and a direction prediction made by a branchhistory table (BHT) 1202 (see FIG. 12) external to the BTAC 402, asdescribed with respect to FIG. 12. In one embodiment, if after executionof the branch, the selected predictor (i.e., BTAC 402 or BHT 1202)correctly predicted the direction, the SELECT bit 724 is not updated.However, if the selected predictor incorrectly predicted the directionbut the other predictor correctly predicted the direction, the SELECTbit 724 is updated to indicate the non-selected predictor rather thanthe selected predictor.

[0132] In one embodiment, the SELECT bit 724 comprises a two-bit up/downsaturating counter, for specifying the four states strongly BTAC, weaklyBTAC, weakly BHT, and strongly BHT. In this embodiment, if afterexecution of the branch, the selected predictor (i.e., BTAC 402 or BHT1202) correctly predicted the direction, the saturating counters counttoward the selected predictor. If the selected predictor incorrectlypredicted the direction but the other predictor correctly predicted thedirection, the saturating counters count toward the non-selectedpredictor.

[0133] Referring now to FIG. 8, a flowchart illustrating operation ofthe speculative branch prediction apparatus 400 of FIG. 4 according tothe present invention is shown. The BTAC 402 of FIG. 4 is indexed by thefetch address 495 of FIG. 4. In response, the BTAC 402 comparators 604of FIG. 6 generate the HIT signal 452 of FIG. 4 in response to the BTAC402 tag array 614 virtual tags 616 of FIG. 6. The control logic 404 ofFIG. 4 examines the HIT signal 452 to determine whether the fetchaddress 495 was a hit in the BTAC 402, in step 802.

[0134] If a BTAC 402 hit did not occur, then the control logic 404 doesnot speculatively branch, in step 822. That is, the control logic 404controls the multiplexer 422 via control signal 478 of FIG. 4 to selectone of the inputs other than the BTAC 402 target address 352 andspeculative call/return stack 406 return address 353.

[0135] However, if a BTAC 402 hit did occur, the control logic 404determines whether the A entry 624 of FIG. 6 is valid, seen and taken,in step 804.

[0136] The control logic 404 determines the entry 624 is “valid” if theVALID bit 702 of FIG. 7 is set. If the VALID bit 702 is set, the line ofthe instruction cache 432 selected by the fetch address 495 is presumedto contain a branch instruction for which branch prediction informationwas previously cached in the A entry 624; however, as discussed above,there is no certainty the selected instruction cache 432 line contains abranch instruction.

[0137] The control logic 404 determines the entry 624 is “taken” if theT/NT field 722 of FIG. 7 for entry A 624 indicates the presumed branchinstruction direction is predicted taken. In the embodiment of FIG. 12described below, the control logic 404 determines the entry 624 is“taken” if the selected direction indicator indicates the presumedbranch instruction direction is predicted taken.

[0138] The control logic 404 determines the entry 624 is “seen” if theBEG field 446 of FIG. 7 is greater than or equal to the correspondingleast significant bits of the fetch address 495. That is, the BEG field446 is compared with the corresponding least significant bits of thefetch address 495 to determine whether the next instruction fetchlocation is before the location of the branch instruction in theinstruction cache 432 corresponding to the A entry 6624. For example,assume the A entry 624 BEG field 446 contains a value of 3, yet thelower bits of the fetch address 495 are 8. In this case, the A entry 624branch instruction could not possibly be branched to by this fetchaddress 495. Consequently, the control logic 404 will not speculativelybranch to the A entry 624 target address 714. This is particularlyrelevant where the fetch address 495 is the target address of a branchinstruction.

[0139] If the A entry 624 is valid, predicted taken, and is seen, thecontrol logic 404 examines the B entry 626 of FIG. 6 is valid, seen andtaken, in step 806. The control logic 404 determines whether the B entry626 is valid, seen and taken in a manner similar to the one describedwith respect to step 804 for the A entry 624.

[0140] If the A entry 624 is valid, predicted taken, and is seen, butthe B entry 626 is not valid, predicted not taken, or is not seen, thecontrol logic 404 examines the RET field 706 of FIG. 7 to determinewhether the A entry 624 has cached return instruction information, instep 812. If the RET bit 706 is not set, the control logic 404 controlsA/B mux 608 of FIG. 6 to select entry A 624 and controls multiplexer 422via control signal 478 to speculatively branch to the BTAC 402 entry A624 target address 714 provided on target address signal 352, in step814. Conversely, if the RET bit 706 indicates a return instruction ispresumably present in the instruction cache 432 line selected by thefetch address 495, the control logic 404 controls multiplexer 422 viacontrol signal 478 to speculatively branch to the speculativecall/return stack 406 return address 353 of FIG. 4, in step 818.

[0141] After speculatively branching during step 814 or step 818, thecontrol logic 404 generates an indication on control signal 482 thatthat a speculative branch was performed in response to the BTAC 402, instep 816. That is, regardless of which of the speculative call/returnstack 406 return address 353 or BTAC 402 entry A 624 target address 352the processor 300 speculatively branched to, the control logic 404indicates on control signal 482 that a speculative branch was performed.The control signal 482 is used to set the SB bit 438 for a byte of theinstruction when it proceeds into the instruction buffer 342 of FIG. 3from the instruction cache 432. In one embodiment, the control logic 404uses the BEG 446 field of the entry 602 to set the SB bit 438 for theopcode byte within the instruction buffer 342 associated with the branchinstruction whose SBI 454 was presumably cached in the BTAC 402 at thefetch address 495 hitting in the BTAC 402.

[0142] If the A entry 624 is invalid, or is predicted not taken, or isnot seen, as determined during step 804, the control logic 404determines whether the B entry 626 is valid, seen and taken, in step824. The control logic 404 determines whether the B entry 626 is valid,seen and taken in a manner similar to the one described with respect tostep 804 for the A entry 624.

[0143] If the B entry 626 is valid, predicted taken, and is seen, thecontrol logic 404 examines the RET field 706 to determine whether the Bentry 626 has cached return instruction information, in step 832. If theRET bit 706 is not set, the control logic 404 controls A/B mux 608 ofFIG. 6 to select entry B 626 and controls multiplexer 422 via controlsignal 478 to speculatively branch to the BTAC 402 entry B 626 targetaddress 714 provided on target address signal 352, in step 834.Conversely, if the RET bit 706 indicates a return instruction ispresumably present in the instruction cache 432 line selected by thefetch address 495, the control logic 404 controls multiplexer 422 viacontrol signal 478 to speculatively branch to the speculativecall/return stack 406 return address 353, in step 818.

[0144] After speculatively branching during step 834 or step 818, thecontrol logic 404 generates an indication on control signal 482 thatthat a speculative branch was performed in response to the BTAC 402, instep 816.

[0145] If both the A entry 624 and the B entry 626 are invalid,predicted not taken, or are not seen, the control logic 404 does notspeculatively branch, in step 822.

[0146] If both the A entry 624 and the B entry 626 are valid, predictedtaken, and seen, the control logic 404 determines which of the presumedbranch instructions whose information is cached in the A entry 624 and Bentry 626 is the first seen of the valid and taken branch instructionsin the instruction cache 432 line instruction bytes 494, in step 808.That is, if both of the presumed branch instructions are seen, valid andtaken, the control logic 404 determines which of the presumed branchinstructions has the smaller memory address by comparing the BEG 446fields of the A entry 624 and B entry 626. If the B entry 626 BEG 446value is smaller than the A entry 624 BEG 446 value, then the controllogic 404 proceeds to step 832 to speculatively branch based on the Bentry 626. Otherwise, the control logic 404 proceeds to step 812 tospeculatively branch based on the A entry 624.

[0147] In one embodiment, the speculative call/return stack 406 is notpresent. Hence, steps 812, 818, and 832 are not performed.

[0148] It may be observed from FIG. 8 that the present inventionadvantageously provides a means for caching a target address andspeculative branch information for multiple branch instructions in agiven instruction cache line in a branch target address cache notintegrated into the instruction cache. In particular, the caching of thebranch instruction location information within the cache line in the BEGfield 446 advantageously enables the control logic 404 to determinewhich of the potentially multiple branch instructions within the cacheline to speculatively branch upon without having to pre-decode the cacheline. That is, the BTAC 402 predicts the target address considering thepossibility that two or more branch instructions may be present in theselected cache line without knowing how many, if any, branchinstructions are present in the cache line.

[0149] Referring now to FIG. 9, a block diagram illustrating an exampleof operation of the speculative branch prediction apparatus 400 of FIG.4 using the steps of FIG. 8 to select a target address 352 of FIG. 4according to the present invention is shown. The example shows a fetchaddress 495 with a value of 0x10000009 indexing the instruction cache432 and BTAC 402 and also being provided to the control logic 404 ofFIG. 4. For simplicity and clarity, the information associated with themulti-way associativity of the instruction cache 432 and BTAC 402, suchas the multiple ways and way mux 606 of FIG. 6, are not shown. A line494 of the instruction cache 432 is selected by the fetch address 495.The line 494 includes an x86 conditional jump instruction (JCC) cachedat address 0x10000002 and an x86 CALL instruction cached at address0x1000000C.

[0150] The example also shows portions of an A entry 602A and a B entry602B within a line of the BTAC 402 selected by the fetch address 495.Entry A 602A contains cached information associated with the CALLinstruction and entry B 602B contains cached information for the JCCinstruction. Entry A 602A shows a VALID bit 702A set to 1 to indicate avalid entry A 602A, i.e., that the associated target address 714 and SBI454 of FIG. 7 are valid. Entry A 602A also shows a BEG field 446A with avalue of 0x0C, corresponding to the least significant bits of theinstruction pointer address of the CALL instruction. Entry A 602A alsoshows a T/NT field 722A with a value of Taken, indicating the CALLinstruction is predicted Taken. The A entry 602A is provided to thecontrol logic 404 via signals 624 of FIG. 6 in response to the fetchaddress 495.

[0151] Entry B 602B shows a VALID bit 702B set to 1 to indicate a validentry B 602B. Entry B 602B also shows a BEG field 446B with a value of0x02, corresponding to the least significant bits of the instructionpointer address of the JCC instruction. Entry B 602B also shows a T/NTfield 722B with a value of Taken, indicating the JCC instruction ispredicted Taken. The B entry 602B is provided to the control logic 404via signals 626 of FIG. 6 in response to the fetch address 495.

[0152] In addition, the BTAC 402 asserts the HIT signal 452 to indicatethat the fetch address 495 caused a hit in the BTAC 402. The controllogic 404 receives entry A 602A and entry B 602B and generates A/Bselect signal 622 of FIG. 6 based on the HIT signal 452, the fetchaddress 495 value, and the two entries 602A and 602B according to themethod described in FIG. 8.

[0153] The control logic 404 determines during step 802 that a hitoccurred in the BTAC 402 based on the HIT signal 452 being asserted. Thecontrol logic 404 next determines during step 804 that entry A 602A isvalid based on the VALID bit 702A being set. The control logic 404 alsodetermines during step 804 that entry A 602A is taken, since the T/NTfield 722A indicates Taken. The control logic 404 also determines duringstep 804 that entry A 602A is seen, since the BEG field 446A value of0x0C is greater than or equal to the corresponding lower bits of thefetch address 495 value of 0x09. Since entry A 602A is valid, taken, andseen, the control logic 404 proceeds to step 806.

[0154] The control logic 404 determines during step 806 entry B 602B isvalid based on the VALID bit 702B being set. The control logic 404 alsodetermines during step 806 that entry B 602B is taken, since the T/NTfield 722B indicates Taken. The control logic 404 also determines duringstep 806 that entry B 602B is not seen, since the BEG field 446B valueof 0x02 is less than the corresponding lower bits of the fetch address495 value of 0x09. Since entry B 602B is not seen, the control logic 404proceeds to step 812.

[0155] The control logic 404 determines during step 812 that the cachedinstruction associated with entry A 602A is not a return instruction viaa clear RET bit 706 of FIG. 7, and proceeds to step 814. During step 814the control logic 404 generates a value on the A/B select signal 622 tocause the A/B mux 608 of FIG. 6 to select entry A 602A on signals 624.The selection causes the target address 714 of FIG. 7 of entry A 602A tobe selected as target address 352 of FIG. 3 for provision to the fetchaddress 495 select mux 422 of FIG. 4.

[0156] Hence, as may be seen from the example of FIG. 9, the branchprediction apparatus 400 of FIG. 4 advantageously operates to select thefirst, valid, seen, taken entry 602 of the selected BTAC 402 line forspeculatively branching the processor 300 to the associated targetaddress 714 contained therein. Advantageously, the apparatus 400advantageously accomplishes speculatively branching even if multiplebranch instructions are present in the corresponding selectedinstruction cache 432 line 494 without knowledge of the actual contentsof the selected line 494.

[0157] Referring now to FIG. 10, a flowchart illustrating operation ofthe branch prediction apparatus 400 of FIG. 4 to detect and correcterroneous speculative branch predictions according to the presentinvention is shown. After an instruction is received from theinstruction buffer 342, the instruction decode logic 436 of FIG. 4decodes the instruction, in step 1002. In particular, the instructiondecode logic 436 formats the stream of instruction bytes into a distinctx86 macroinstruction, and determines the length of the instruction andwhether the instruction is a branch instruction.

[0158] Next, the prediction check logic 408 of FIG. 4 determines whetherthe SB bit 438 is set for any of the instruction bytes of theinstruction being decoded, in step 1004. That is, the prediction checklogic 408 determines whether a speculative branch was previouslyperformed based on a BTAC 402 hit of the currently decoded instruction.If no speculative branch was performed, then no action is taken tocorrect it.

[0159] If a speculative branch was performed, then the prediction checklogic 408 examines the currently decoded instruction to determinewhether the instruction is a non-branch instruction, in step 1012.Preferably, the prediction check logic 408 determines whether theinstruction is a non-branch instruction for the x86 instruction set.

[0160] If the instruction is not a branch instruction, the predictioncheck logic 408 asserts the ERR signal 456 of FIG. 4 to indicate thedetection of an erroneous speculative branch, in step 1022. In addition,the BTAC 402 is updated via update signal 442 of FIG. 4 to clear theVALID bit 702 of FIG. 7 for the corresponding BTAC 402 entry 602 of FIG.6. Furthermore, the instruction buffer 342 of FIG. 3 is flushed of theinstructions erroneously fetched from the instruction cache 432 becauseof the erroneous speculative branch.

[0161] If the instruction is not a branch instruction, the control logic404 next controls multiplexer 422 of FIG. 4 to branch to the CIP 468generated by the instruction decode logic 436 to correct for theerroneous speculative branch, in step 1024. The branch during step 1024will cause the instruction cache 432 line including the instruction tobe re-fetched and speculatively predicted. However, this time, the VALIDbit 702 will be clear for the instruction; consequently, no speculativebranch will be performed for the instruction, thereby accomplishing thecorrection of the previous erroneous speculative branch.

[0162] If it is determined during step 1012 that the instruction is avalid branch instruction, the prediction check logic 408 determineswhether the SB bit 438 is set for any of the bytes in the instruction ina non-opcode byte location within the instruction bytes of the decodedinstruction, in step 1014. That is, although a byte may contain a validopcode value for the processor 300 instruction set, the valid opcodevalue may be in a byte location that is not valid for the instructionformat. For an x86 instruction, barring prefix bytes, the opcode byteshould be the first byte of the instruction. For example, the SB bit 438may erroneously be set for a branch opcode value in an immediate data ordisplacement field of the instruction, or in a mod R/M or SIB byte of anx86 instruction due to a virtual aliasing condition. If the branchopcode byte is in a non-opcode byte location, then steps 1022 and 1024are performed to correct the erroneous speculative prediction.

[0163] If the prediction check logic 408 determines during step 1012that the instruction is a valid branch instruction, and determinesduring step 1014 no SB bits 438 are set for non-opcode bytes, then theprediction check logic 408 determines whether there is a speculative andnon-speculative instruction length mismatch, in step 1016. That is, theprediction check logic 408 compares the non-speculative instructionlength generated by the instruction decode logic 436 during step 1002with the speculative LEN 448 field of FIG. 7 generated by the BTAC 402.If the instruction lengths do not match, then steps 1022 and 1024 areperformed to correct the erroneous speculative prediction.

[0164] If the prediction check logic 408 determines during step 1012that the instruction is a valid branch instruction, and determinesduring step 1014 the SB bit 438 is set only for the opcode byte, anddetermines during step 1016 the instruction lengths match, then theinstruction proceeds down the pipeline 300 until it reaches the E-stage326 of FIG. 3. The E-stage 326 resolves the correct branch instructiontarget address 356 of FIG. 3 and also determines the correct branchdirection DIR 481 of FIG. 4, in step 1032.

[0165] Next, the prediction check logic 408 determines whether the BTAC402 erroneously predicted the direction of the branch instruction, instep 1034. That is, the prediction check logic 408 compares the correctdirection DIR 481 resolved by the E-stage 326 with the prediction 722 ofFIG. 7 generated by the BTAC 402 to determine if an erroneousspeculative branch was performed.

[0166] If the BTAC 402 predicted an erroneous direction, the predictioncheck logic 408 asserts the ERR signal 456 to notify the control logic404 of the error, in step 1042. In response, the control logic 404updates the BTAC 402 direction prediction 722 via update signal 442 ofFIG. 4 for the corresponding BTAC 402 entry 602 of FIG. 6. Finally, thecontrol logic 404 flushes the processor pipeline 300 of the instructionserroneously fetched from the instruction cache 432 because of theerroneous speculative branch, in step 1042. Next, the control logic 404controls the multiplexer 422 to select the NSIP 466 of FIG. 4, causingthe processor 300 to branch to the next instruction after the branchinstruction to correct the erroneous speculative branch, in step 1044.

[0167] If no direction error is detected during step 1034, theprediction check logic 408 determines whether the BTAC 402 orspeculative call/return stack 406 erroneously predicted the targetaddress of the branch instruction, in step 1036. That is, if theprocessor 300 speculatively branched to the BTAC 402 target address 352,then the prediction check logic 408 examines the result 485 ofcomparator 489 of FIG. 4 to determine whether the speculative targetaddress 352 mismatches the resolved correct target address 356.Alternatively, if the processor 300 speculatively branched to thespeculative call/return stack 406 return address 353, then theprediction check logic 408 examines the result 487 of comparator 497 ofFIG. 4 to determine whether the speculative return address 353mismatches the resolved correct target address 356.

[0168] If a target address error is detected during step 1036, theprediction check logic 408 asserts the ERR signal 456 to indicate thedetection of an erroneous speculative branch, in step 1052. In addition,the control logic 404 updates the BTAC 402 via update signal 442 withthe resolved target address 356 generated during step 1032 for thecorresponding BTAC 402 entry 602 of FIG. 6. Furthermore, the pipeline300 is flushed of the instructions erroneously fetched from theinstruction cache 432 because of the erroneous speculative branch. Next,the control logic 404 controls multiplexer 422 of FIG. 4 to branch tothe resolved correct target address 356, thereby correcting the previouserroneous speculative branch, in step 1054.

[0169] Referring now to FIG. 11, sample code fragments and a table 1100illustrating an example of the speculative branch mispredictiondetection and correction of FIG. 10 according to the present inventionis shown. The code fragments comprise a previous code fragment and acurrent code fragment. For example, the previous code fragmentillustrates the code present in the instruction cache 432 of FIG. 4 at avirtual address 0x00000010 prior to a task switch of the processor 300of FIG. 3. The current code fragment illustrates the code present in theinstruction cache 432 at virtual address 0x00000100 after the taskswitch, such as may occur in a virtual aliasing condition.

[0170] The previous code sequence includes an x86 JMP (unconditionaljump) instruction at address location 0x00000010. The target address ofthe JMP is address 0x00001234. The JMP has already been executed; hence,the target address 0x00001234 is already cached in the BTAC 402 of FIG.4 for address 0x00000010 at the time the current code sequence executes.That is, the target address 714 is cached, the VALID bit 702 is set, theBEG 446, LEN 448, and WRAP 708 fields are populated with appropriatevalues, and the CALL 704 and RET 706 bits of FIG. 7 are cleared. In thisexample, it is assumed the T/NT field 722 indicates the cached branchwill be taken and the JMP is cached in the A entry 624 of the BTAC 402line.

[0171] The current code sequence includes an ADD (arithmetic add)instruction at 0 x00000010, the same virtual address of the JMPinstruction in the previous code sequence. At location 0x00001234 in thecurrent code sequence is a SUB (arithmetic subtract) instruction, and at0x00001236 is an INC (arithmetic increment) instruction.

[0172] The table 1100 comprises eight columns and six rows. The lastseven columns of the first row designate seven clock cycles, 1 through7. The last five rows of the first column designate the first fivestages of the pipeline 300, namely the I-stage 302, B-stage 304, U-stage306, V-stage 308, and F-stage 312. The remaining cells of the tablespecify the contents of each of the stages during the various clockcycles while executing the current code sequence.

[0173] During clock cycle 1, the BTAC 402 and instruction cache 432 areaccessed. The ADD instruction is shown in I-stage 302. The fetch address495 of FIG. 4 with a value of 0x00000010 indexes the instruction cache432 and the BTAC 402 for determining if a speculative branch isnecessary according to FIG. 8. In the example of FIG. 11, a BTAC 402 hitwill occur for a fetch address 495 value of 0x00000010 as discussedbelow.

[0174] During clock cycle 2, the ADD instruction is shown in the B-stage304. This is the second clock of the instruction cache 432 fetch cycle.The tag array 614 provides the tags 616 and the data array 612 providesthe entries 602 of FIG. 6, including the target address 714 and SBI 454of FIG. 7 for each of the entries 602. The comparators 604 of FIG. 6generate a tag hit on signal 452 of FIG. 4 according to step 802 of FIG.8 since the JMP of the previous code sequence had been cached after itsexecution. The comparators 604 also control way mux 606 via signal 618to select the appropriate way. The control logic 404 examines the SBI454 of the A entry 624 and B entry 626 and selects the A entry 624 inthis example for provision as the target address 352 and SBI 454., Thecontrol logic 404 also determines that the entry is valid, taken, seen,and is not a return instruction in this example according to steps 804and 812.

[0175] During cycle 3, the ADD instruction is shown in U-stage 306. TheADD instruction is provided by the instruction cache 432 and latched inthe U-stage 306. Because of steps 802 through 814 of FIG. 8 beingperformed during clock cycle 2, the control logic 404 controlsmultiplexer 422 of FIG. 4 via control signal 478 to select the targetaddress 352 provided by the BTAC 402.

[0176] During clock cycle 4, the ADD proceeds to the V-stage 308, whereit is written to the instruction buffer 342. Clock cycle 4 is thespeculative branch cycle. That is, the processor 300 begins fetchinginstructions at the cached target address 352 value 0x00001234 accordingto step 814 of FIG. 8. That is, the fetch address 495 is changed toaddress 0x00001234 to accomplish a speculative branch to that addressaccording to FIG. 8. Hence, the SUB instruction, located at address0x00001234, is shown in the I-stage 302 during clock cycle 4.Additionally, the control logic 404 indicates via signal 482 of FIG. 4that a speculative branch has been performed. Consequently, an SB bit438 is set in the instruction buffer 342 corresponding to the ADDinstruction according to step 816 of FIG. 8.

[0177] During clock cycle 5, the error in the speculative branch isdetected. The ADD instruction proceeds to the F-stage 312. The SUBinstruction proceeds to the B-stage 304. The INC instruction, theinstruction at the next sequential instruction pointer, is shown in theI-stage 302. The F-stage 312 instruction decode logic 436 of FIG. 4decodes the ADD instruction and generates the CIP 468 of FIG. 4. Theprediction check logic 408 detects via signal 484 that an SB bit 438associated with the ADD instruction is set according to step 1004. Theprediction check logic 408 also detects that the ADD instruction is anon-branch instruction according to step 1012, and subsequently assertsthe ERR signal 456 of FIG. 4 according to step 1022 to signify theerroneous speculative branch performed during cycle 4.

[0178] During clock cycle 6, the erroneous speculative branch isinvalidated. The instruction buffer 342 is flushed according to step1022. In particular, the ADD instruction is flushed from the instructionbuffer 342. Additionally, the BTAC 402 is updated to clear the VALID bit702 associated with the entry 602 that caused the erroneous speculativebranch according to step 1022. Furthermore, the control logic 404controls multiplexer 422 to select the CIP 468 as the fetch address 495during the next cycle.

[0179] During clock cycle 7, the erroneous speculative branch iscorrected. The processor 300 begins fetching instructions from theinstruction cache 432 at the instruction pointer of the ADD instructionthat was being decoded by the instruction decode logic 436 when theerror was detected during clock cycle 5. That is, the processor 300branches to CIP 468 corresponding to the ADD instruction according tostep 1024, thereby correcting the erroneous speculative branch performedduring clock cycle 5. Hence, the ADD instruction is shown in the I-stage302 during clock cycle 7. This time, the ADD will proceed down thepipeline 300 and execute.

[0180] Referring now to FIG. 12, a block diagram illustrating analternate embodiment of the branch prediction apparatus 400 of FIG. 4including a hybrid speculative branch direction predictor 1200 accordingto the present invention is shown. It may be readily observed that themore accurate the branch direction prediction of the BTAC 402, the moreeffective speculative branching to the speculative target address 352generated by the BTAC 402 is in reducing branch delay penalty. Statedconversely, the less frequently an erroneous speculative branch must becorrected, as described with respect to FIG. 10, the more effectivespeculative branching to the speculative target address 352 generated bythe BTAC 402 is in reducing the processor 300 average branch delaypenalty. The direction predictor 1200 comprises the BTAC 402 of FIG. 4,a branch history table (BHT) 1202, exclusive OR logic 1204, globalbranch history registers 1206 and a multiplexer 1208.

[0181] The global branch history registers 1206 comprise a shiftregister for storing a global history of branch instruction directionoutcomes 1212 for all branch instructions executed by the processor 300received by the global branch history registers 1206. Each time theprocessor 300 executes a branch instruction, the DIR 481 bit of FIG. 4is written into the shift register 1206 with the bit set if the branchdirection was taken and the bit clear if the branch direction was nottaken. Accordingly, the oldest bit is shifted out of the shift register1206. In one embodiment, the shift register 1206 stores 13 bits ofglobal history. The storage of global branch history is well known inthe art of branch prediction for improving prediction of the outcome ofbranch instructions that exhibit a high dependency with other branchinstructions in a program.

[0182] The global branch history 1206 is provided via signals 1214 tothe exclusive OR logic 1204 for performance of a logical exclusive ORoperation with the fetch address 495 of FIG. 4. The output 1216 of theexclusive OR logic 1204 is provided as an index to the branch historytable 1202. The function performed by the exclusive OR logic 1204 iscommonly referred to as a gshare operation in the art of branchprediction.

[0183] The branch history table 1202 comprises an array of storageelements for storing a history of branch direction outcomes for aplurality of branch instructions. The array is indexed by the output1216 of the exclusive OR logic 1204. When the processor 300 executes abranch instruction, the array element of the branch history table 1202indexed by the exclusive OR logic 1204 output 1216 is selectivelyupdated via signal 1218 as a function of the resolved branch directionDIR 481.

[0184] In one embodiment, each of the storage elements in the branchhistory table 1202 array comprises two direction predictions: an A and Bdirection prediction. Preferably, the branch history table 1202generates the A and B direction predictions on T/NT_(—) A/B 1222 signalsas shown, for specifying a direction prediction to be selected againsteach of the A entry 624 and B entry 626 of FIG. 6 generated by the BTAC402. In one embodiment, the branch history table 1202 array of storageelements comprises 4096 entries each storing two direction predictions.

[0185] In one embodiment, each of the A and B predictions comprises asingle T/NT (taken/not taken) bit. In this embodiment, the single T/NTbit is updated with the value of the DIR bit 481. In another embodiment,each of the A and B predictions comprises a two-bit up/down saturatingcounter, for specifying the four states strongly taken, weakly taken,weakly not taken, and strongly not taken. In this embodiment, thesaturating counters count in the direction indicated by the DIR bit 481.

[0186] The mux 1208 receives the two direction prediction bits T/NT_(—)A/B 1222 from the branch history table 1202 and the T/NT directionprediction 722 of FIG. 7 for each of the A entry 624 and B entry 626from the BTAC 402. The mux 1208 receives as select control signals theSELECT bit 724 for each of the A entry 624 and B entry 626 from the BTAC402. The A entry 624 SELECT bit 724 selects from among the two A inputsa T/NT for the A entry 624. The B entry 626 SELECT bit 724 selects fromamong the two B inputs a T/NT for the B entry 626. The two selected T/NTbits 1224 are provided to the control logic 404 for use in controllingmultiplexer 422 via signal 478 of FIG. 4. In the embodiment of FIG. 12,the two selected T/NT bits 1224 are comprised in entry A 624 and entry B626, respectively, shown in FIG. 6 provided to the control logic 404.

[0187] It may be observed that if the processor 300 branches to thetarget address 352 generated by the BTAC 402 based, at least in part, onthe direction predictions 1222 provided by the branch history table1202, it does so speculatively. The branch is speculative because,although a hit in the BTAC 402 indicates that a branch instruction waspreviously present in the instruction cache 432 line selected by thefetch address 495, there is no certainty that a branch instructionresides in the selected instruction cache 432 line, as discussed above.

[0188] It may also be observed that the hybrid speculative branchdirection predictor 1200 of FIG. 12 potentially advantageously providesa more accurate branch direction prediction than the BTAC 402 directionprediction 722 alone. In particular, generally speaking, the branchhistory table 1202 provides a more accurate prediction for branches thatare highly dependent upon the history of other branches; whereas, theBTAC 402 provides a more accurate prediction for branches that are nothighly dependent upon the history of other branches. The SELECT bits 724enable a selection of the more accurate predictor for a given branch.Thus, it may be observed that the direction predictor 1200 of FIG. 12advantageously works in conjunction with the BTAC 402 to enable moreaccurate speculative branching using the target address 352 provided bythe BTAC 402.

[0189] Referring now to FIG. 13, a flowchart illustrating operation ofthe dual call/return stacks 406 and 414 of FIG. 4 is shown. It is acharacteristic of computer programs that subroutines may be called frommultiple locations within the program. Consequently, the return addressfor a return instruction within the subroutine may vary widely. Thus, ithas been observed that it is often difficult to predict a return addressusing a branch target address cache, thereby necessitating the advent ofcall/return stacks. The dual call/return address stack scheme of thepresent invention provides the benefits of call/return stacks generally,i.e., more accurate prediction of return addresses than a simple BTAC,in addition to the benefits of the speculative BTAC of the presentinvention, such as prediction of a branch target address early in thepipeline 300 in order to reduce the branch penalty.

[0190] The BTAC 402 of FIG. 4 is indexed by the fetch address 495 ofFIG. 4 and the control logic 404 of FIG. 4 examines the HIT signal 452to determine whether the fetch address 495 was a hit in the BTAC 402 andexamines the VALID bit 702 of the SBI 454 to determine whether theselected BTAC 402 entry 602 is valid, in step 1302. If a BTAC 402 hitdid not occur or the VALID bit 702 is not set, then the control logic404 does not cause the processor 300 to speculatively branch.

[0191] If a valid BTAC 402 hit occurred during step 1302, then thecontrol logic 404 examines the CALL bit 704 of FIG. 7 of the SBI 454 ofFIG. 4 to determine whether the cached branch instruction isspeculatively, or presumably, a call instruction, in step 1304. If theCALL bit 704 is set, then the control logic 404 controls the speculativecall/return stack 406 to push the speculative return address 491, instep 1306. That is, the speculative return address 491 of the presumedcall instruction, comprising the sum of the fetch address 495, BEG 446,and LEN 448 of FIG. 4 are saved in the speculative call/return stack406. The speculative return address 491 is speculative because it is notcertain that the line of the instruction cache 432 associated with thefetch address 495 that hit in the BTAC 402 actually contains a callinstruction, much less the call instruction for which the BEG 446 andLEN 448 are cached in the BTAC 402. The speculative return address 491,or target address, may be speculatively branched to as provided onreturn address signal 353 the next time a return instruction isexecuted, as will be described below with respect to steps 1312 through1318.

[0192] If the CALL bit 704 is set, the control logic 404 next controlsthe multiplexer 422 to select the BTAC 402 target address 352 of FIG. 3in order to speculatively branch to the target address 352, in step1308.

[0193] If the control logic 404 determines during step 1304 that theCALL bit 704 is not set, then the control logic 404 examines the RET bit706 of FIG. 7 of the SBI 454 to determine whether the cached branchinstruction is speculatively, or presumably, a return instruction, instep 1312. If the RET bit 706 is set, then the control logic 404controls the speculative call/return stack 406 to pop the speculativereturn address 353 of FIG. 3 from the top of its stack, in step 1314.

[0194] After popping the speculative return address 353, the controllogic 404 controls the multiplexer 422 to select the speculative returnaddress 353 popped off the speculative call/return stack 406 in order tospeculatively branch to the return address 353, in step 1316.

[0195] The return instruction proceeds down the pipeline 300 until itreaches the F-stage 312 of FIG. 3 and the instruction decode logic 436of FIG. 4 decodes the presumed return instruction. If the presumedreturn instruction is in fact a return instruction, the non-speculativecall/return stack 414 of FIG. 4 generates a non-speculative returnaddress 355 of FIG. 3 for the return instruction. The comparator 418 ofFIG. 4 compares the speculative return address 353 with thenon-speculative return address 355 and provides the result 474 to thecontrol logic 404, in step 1318.

[0196] The control logic 404 examines the comparator 418 result 474 todetermine if a mismatch occurred, in step 1324. If the speculativereturn address 353 and the non-speculative return address 355 do notmatch, then the control logic 404 controls multiplexer 422 to select thenon-speculative return address 355 in order to cause the processor 300to branch to the non-speculative return address 355, in step 1326.

[0197] If the control logic 404 determines during step 1304 that theCALL bit 704 is not set, and determines during step 1312 that the RETbit 706 is not set, then the control logic 404 controls multiplexer 422to speculatively branch to the BTAC 402 target address 352 of FIG. 3 asdescribed in steps 814 or 834 of FIG. 8, in step 1322.

[0198] Thus, it may observed from FIG. 13, that the operation of thedual call/return stacks of FIG. 4 potentially reduces the branch penaltyof call and return instructions. The potential reduction is achieved byenabling the processor 300 to branch earlier in the pipeline for calland return instructions in conjunction with the BTAC 402, while alsoovercoming the phenomenon that return instructions commonly return tomultiple different return addresses by virtue of the fact thatsubroutines are commonly called from a number of different programlocations.

[0199] Referring now to FIG. 14, a flowchart illustrating operation ofthe branch prediction apparatus 400 of FIG. 4 to selectively overridespeculative branch predictions with non-speculative branch predictionsthereby improving the branch prediction accuracy of the presentinvention is shown. After an instruction is received from theinstruction buffer 342, the instruction decode logic 436 of FIG. 4decodes the instruction and the non-speculative target addresscalculator 416, non-speculative call/return stack 414, andnon-speculative branch direction predictor 412 of FIG. 4 generatenon-speculative branch predictions in response to the instruction decodeinformation 492 of FIG. 4, in step 1402. The instruction decode logic436 generates a type of the instruction provided in the instructiondecode information 492, in step 1402.

[0200] In particular, the instruction decode logic 436 determineswhether the instruction is a branch instruction, the length of theinstruction, and the type of the branch instruction. Preferably, theinstruction decode logic 436 determines whether the branch instructionis a conditional or unconditional type branch instruction, a PC-relativetype branch instruction, a return instruction, a direct type branchinstruction, or an indirect type branch instruction.

[0201] If the instruction is a branch instruction, the non-speculativebranch direction predictor 412 generates the non-speculative directionprediction 444 of FIG. 4. In addition, the non-speculative targetaddress calculator 416 calculates the non-speculative target address 354of FIG. 3. Finally, if the instruction is a return instruction, thenon-speculative call/return stack 414 generates the non-speculativereturn address 355 of FIG. 3.

[0202] The control logic 404 determines whether the branch instructionis a conditional branch instruction, in step 1404. That is, the controllogic 404 determines whether the instruction may be taken or not takendepending upon a condition, such as whether certain flag bits are set,such as a zero flag, carry flag, etc. In the x86 instruction set, theJCC instruction is a conditional type branch instruction. In contrast,the RET, CALL and JUMP instructions, for example, are unconditionalbranch instructions in the x86 instruction set because they always havea direction of taken.

[0203] If the branch is a conditional type branch instruction, thecontrol logic 404 determines whether there is a mismatch between thenon-speculative direction 444 predicted by the non-speculative branchdirection predictor 412 and the speculative direction 722 of FIG. 7 inthe SBI 454 predicted by the BTAC 402, in step 1412.

[0204] If there is a direction prediction mismatch, the control logic404 determines whether the non-speculative direction prediction 444 istaken or not taken, in step 1414. If the non-speculative directionprediction 444 is not taken, the control logic 404 controls multiplexer422 to select the NSIP 466 of FIG. 4 in order to branch to theinstruction after the current branch instruction, in step 1416. That is,the control logic 404 selectively overrides the speculative BTAC 402direction prediction. The speculative direction prediction 722 isoverridden because the non-speculative direction prediction 444 isgenerally more accurate.

[0205] If the non-speculative direction prediction 444 is taken, thecontrol logic 404 controls multiplexer 422 to branch to thenon-speculative target address 354, in step 1432. Again, the speculativedirection prediction 722 is overridden because the non-speculativedirection prediction 444 is generally more accurate.

[0206] If the control logic 404 determines during step 1412 that thereis not a direction prediction mismatch, and that a speculative branchwas performed for the branch instruction (i.e., if the SB bit 438 isset), the control logic 404 determines whether there is a mismatchbetween the speculative target address 352 and the non-speculativetarget address 354, in step 1428. If there is a target address mismatchfor a conditional type branch, the control logic 404 controlsmultiplexer 422 to branch to the non-speculative target address 354, instep 1432. The speculative target address prediction 352 is overriddenbecause the non-speculative target address prediction 354 is generallymore accurate. If there is not a target address mismatch for aconditional type branch, no action is taken. That is, the speculativebranch is allowed to proceed, subject to error correction as describedwith respect to FIG. 10.

[0207] If during step 1404, the control logic 404 determines the branchinstruction is not a conditional type branch, the control logic 404determines whether the branch instruction is a return instruction, instep 1406. If the branch instruction is a return instruction, thecontrol logic 404 determines whether there is a mismatch between thespeculative return address 353 generated by the speculative call/returnstack 406 and the non-speculative return address 355 generated by thenon-speculative call/return stack 414, in step 1418.

[0208] If there is a mismatch between the speculative return address 353and the non-speculative return address 355, the control logic 404controls the multiplexer 422 to branch to the non-speculative returnaddress 355, in step 1422. That is, the control logic 404 selectivelyoverrides the speculative return address 353. The speculative returnaddress 353 is overridden because the non-speculative return address 355is generally more accurate. If there is not a target address mismatchfor a direct type branch, no action is taken. That is, the speculativebranch is allowed to proceed, subject to error correction as describedwith respect to FIG. 10. It is noted that steps 1418 and 1422 correspondto steps 1324 and 1326 of FIG. 13, respectively.

[0209] If during step 1406, the control logic 404 determines the branchinstruction is not a return instruction, the control logic 404determines whether the branch instruction is a PC-relative type branchinstruction, in step 1408. In the x86 instruction set, a PC-relativetype branch instruction is a branch instruction in which a signed offsetspecified in the branch instruction is added to the current programcounter value to compute the target address.

[0210] In an alternate embodiment, the control logic 404 also determineswhether the branch instruction is a direct type branch instruction, instep 1408. In the x86 instruction set, a direct type branch instructionis a branch instruction in which the target address is specified in theinstruction itself. Direct type branch instructions are also referred toas immediate type branch instructions, since the target address isspecified in an immediate field of the instruction.

[0211] If the branch instruction is a PC-relative type branchinstruction, the control logic 404 determines whether there is amismatch between the speculative target address 352 and thenon-speculative target address 354, in step 1424. If there is a targetaddress mismatch for a PC-relative type branch, the control logic 404controls multiplexer 422 to branch to the non-speculative target address354, in step 1426. The speculative target address prediction 352 isoverridden because the non-speculative target address prediction 354 isgenerally more accurate for a PC-relative type branch. If there is not atarget address mismatch for a PC-relative type branch, no action istaken. That is, the speculative branch is allowed to proceed, subject toerror correction as described with respect to FIG. 10.

[0212] If during step 1408, the control logic 404 determines the branchinstruction is not a PC-relative type branch instruction, no action istaken. That is, the speculative branch is allowed to proceed, subject toerror correction as described with respect to FIG. 10. In oneembodiment, the non-speculative target address calculator 416 comprisesa relatively small branch target buffer (BTB) in the F-stage 312 thatcaches branch target addresses only for indirect type branchinstructions as described above with respect to FIG. 4.

[0213] It has been observed that for indirect type branch instructions,the BTAC 402 prediction is generally more accurate than the relativelysmall F-stage 312 BTB. Hence, if it is determined that the branch is anindirect type branch instruction, the control logic 404 does notoverride the BTAC 402 speculative prediction. That is, if a speculativebranch was performed due to a BTAC 402 hit as described in FIG. 8 for anindirect type branch instruction, the control logic 404 does notoverride the speculative branch by branching to the indirect type BTBtarget address. However, even though for indirect type branches thespeculative target address 352 generated by the BTAC 402 is notoverridden by the non-speculative target address 354, a target addresscompare will also be performed later in the pipeline 300 between thespeculative target address 352 and the non-speculative target address356 of FIG. 3 received from the S-stage 328 in order to perform step1036 of FIG. 10 to detect an erroneous speculative branch.

[0214] Referring now to FIG. 15, a block diagram illustrating anapparatus for replacing a target address in the BTAC 402 of FIG. 4according to the present invention is shown. For simplicity and clarity,the information associated with the multi-way associativity of the BTAC402, such as the multiple ways and way mux 606 of FIG. 6, are not shown.The BTAC 402 data array 612 of FIG. 6 is shown comprising a selectedline of the BTAC 402 comprising an entry A 602A and an entry B 602B,which are provided to the control logic 404 via signals 624 and 626 ofFIG. 6, respectively. The entry A 602A and entry B 602B include theirassociated VALID bits 702 of FIG. 7.

[0215] The selecting BTAC 402 line also includes an A/B LRU bit 1504 forindicating which of entry A 602A and entry B 602B was least recentlyused. In one embodiment, each time a BTAC 402 hit occurs on a giventarget address 714, the A/B LRU bit 1504 is updated to specify theopposite entry of the entry for which the hit occurred. That is, if thecontrol logic 404 proceeds to step 812 of FIG. 8 since a hit occurred onentry A 602A, then the A/B LRU bit 1504 is updated to indicate entry B602B. Conversely, if the control logic 404 proceeds to step 832 of FIG.8 since a hit occurred on entry B 602B, then the A/B LRU bit 1504 isupdated to indicate entry A 602A. The A/B LRU bit 1504 is also providedto the control logic 404.

[0216] The replacement apparatus also includes a multiplexer 1506. Themux 1506 receives as inputs the fetch address 495 of FIG. 4 and anupdate instruction pointer (IP) 1512. The mux 1506 selects one of theinputs based on a read/write control signal 1516 provided by the controllogic 404. The read/write control signal 1516 is also provided to theBTAC 402. When the read/write control signal 1516 indicates “read”, themux 1506 selects the fetch address 495 for provision to the BTAC 402 viasignal 1514 for reading the BTAC 402. When the read/write control signal1516 indicates “write”, the mux 1506 selects the update IP 1512 forprovision to the BTAC 402 via signal 1514 for writing the BTAC 402 withan updated target address 714 and/or SBI 454 and/or A/B LRU bit 1504 viaupdate signal 442 FIG. 4.

[0217] When a branch instruction executes and is taken, the targetaddress 714 of the branch instruction and associated SBI 454 are writteninto, or cached in, a BTAC 402 entry 602. That is, the BTAC 402 isupdated with the new target address 714 of the executed branchinstruction and associated SIB 454. The control logic 404 must decidewhich side, A or B, of the BTAC 402 to update for the BTAC 402 line andway selected by the update IP 1512. That is, the control logic 404 mustdecide whether to replace the entry A 602A or the entry B 602B of theselected line and way. The control logic 404 decides which side toreplace as shown in Table 1 below. TABLE 1 Valid A Valid B Replace 0 0˜LastWritten 0 1 A 1 0 B 1 1 LRU

[0218] Table 1 is a truth table having two inputs, the VALID bit 702 ofentry A 602A and the VALID bit 702 of entry B 602B. The output of thetruth table is the action for determining the side of the BTAC 402 toreplace. As shown, if the A entry 602A is invalid and the B entry 602Bis valid, then the control logic 404 replaces the A entry 602A. If the Aentry 602A is valid and the B entry 602B is invalid, then the controllogic 404 replaces the B entry 602B. If both the A entry 602A and Bentry 602B are valid, then the control logic 404 replaces the leastrecently used entry as specified by the A/B LRU bit 1504 in the line andway selected by the update IP 1512.

[0219] If both the A entry 602A and B entry 602B are invalid, then thecontrol logic 404 must decide which side to replace. One solution is toalways write to one side, for example, side A. However, this solutionposes a problem illustrated by Code Sequence 1 below.

[0220] 0x0000010 JMP 0x00000014

[0221] 0x00000014 ADD BX, 1

[0222] 0x00000016 CALL 0x12345678

[0223] Code Sequence 1.

[0224] In Code Sequence 1, the three instructions shown are in the sameinstruction cache 432 line because their instruction pointer addressesare equal except for the lower 4 address bits; accordingly, the JMP andCALL instructions select the same BTAC 402 line and way. Assume in thisexample both the A entry 602A and the B entry 602B in the BTAC 402 lineand way selected by the instruction pointers for the JMP and CALLinstructions are invalid when the instructions execute. Using thesolution of “always update side A when both entries are invalid”, theJMP instruction would see that both sides are invalid and would updatethe A entry 602A.

[0225] However, since the CALL instruction is relatively close to theJMP instruction in the program sequence, if the pipeline is relativelylong, as in processor 300, a relatively large number of cycles may passbefore the VALID bit 702 of entry A 602A is updated. Hence, a highprobability exists that the CALL instruction will sample the BTAC 402before the BTAC 402 is updated by the executed JMP instruction, and inparticular, before the entry A 602A VALID bit 702 and BTAC 402 wayreplacement status for the selected BTAC 402 line is updated by the JMPinstruction. Hence, the CALL instruction will see that both sides areinvalid and will also update the A entry 602A according to the “alwaysupdate side A when both entries are invalid” solution. This isproblematic, since the target address 714 for the JMP instruction willbe needlessly clobbered since an empty, i.e., invalid B entry 602B wasavailable for caching the target address 714 of the CALL instruction.

[0226] To solve this problem, as shown in Table 1, if both the A entry602A and B entry 602B are invalid, then the control logic 404advantageously selects the side which is the inverse, or not, of a sidestored in a global replacement status flag register, LastWritten 1502,comprised in and updated by the replacement apparatus. The LastWrittenregister 1502 stores an indication of whether side A or B of the BTAC402 was last written to an invalid entry 602 of the BTAC 402 globally.Advantageously, the method uses the LastWritten register 1502 to avoidthe problem illustrated by Code Sequence 1 above as described presentlywith respect to FIGS. 16 and 17.

[0227] Referring now to FIG. 16, a flowchart illustrating a method ofoperation of the apparatus of FIG. 15 according to the present inventionis shown. FIG. 16 illustrates one embodiment of Table 1 described above.

[0228] When the control logic 404 needs to update a BTAC 402 entry 602,the control logic 404 examines the VALID bit 702 for each of theselected A entry 602A and B entry 602B. The control logic 404 determinesif both the A entry 602A and the B entry 602B are valid, in step 1602.If both entries are valid, the control logic 404 examines the A/B LRUbit 1504 bit to determine whether entry A 602A or entry B 602B was leastrecently used, in step 1604. If entry A 602A was least recently used,the control logic 404 replaces entry A 602A, in step 1606. If entry B602B was least recently used, the control logic 404 replaces entry B602B, in step 1608.

[0229] If the control logic 404 determines during step 1602 that notboth entries are valid, it determines whether the A entry 602A is validand the B entry 602B is invalid, in step 1612. If so, the control logic404 replaces the B entry 602B, in step 1614. Otherwise, the controllogic 404 determines whether the A entry 602A is invalid and the B entry602B is valid, in step 1622. If so, the control logic 404 replaces the Aentry 602A, in step 1624. Otherwise, the control logic 404 examines theLastWritten register 1502, in step 1632.

[0230] If the LastWritten register 1502 indicates the A side of the BTAC402 was not last written to a selected line and way in which both the Aentry 602A and the B entry 602B are invalid, the control logic 404replaces the A entry 602A, in step 1634. The control logic 404subsequently updates the LastWritten register 1502 to specify that sideA of the BTAC 402 was the last side written to a selected line and wayin which both the A entry 602A and the B entry 602B were invalid, instep 1636.

[0231] If the LastWritten register 1502 indicates the B side of the BTAC402 was not last written to a selected line and way in which both the Aentry 602A and the B entry 602B are invalid, the control logic 404replaces the B entry 602B, in step 1644. The control logic 404subsequently updates the LastWritten register 1502 to specify that sideB of the BTAC 402 was the last side written to a selected line and wayin which both the A entry 602A and the B entry 602B were invalid, instep 1646.

[0232] As may be observed, the method of FIG. 16 avoids overwriting thetarget address of the JMP instruction with the target address of theCALL instruction in Code Sequence 1 above. Assume the LastWrittenregister 1502 specifies side A when the JMP instruction is executed. Thecontrol logic 404 will update the B entry 602B according to FIG. 16 andTable 1 since side B is not the last side written. Additionally, thecontrol logic 404 will update the LastWritten register 1502 to specifythe B side. Consequently, when the CALL instruction is executed, thecontrol logic 404 will update the A entry 602A according to FIG. 16,since when the BTAC 402 was sampled, both entries were invalid, and theLastWritten register 1502 specified that side A was not the last sidewritten. Hence, advantageously, the target address for both the JMP andCALL instructions will be cached in the BTAC 402 for subsequentspeculative branching thereto.

[0233] Referring now to FIG. 17, a flowchart illustrating a method ofoperation of the apparatus of FIG. 15 according to an alternateembodiment of the present invention is shown. The steps of FIG. 17 areidentical to the steps of FIG. 16, except that FIG. 17 includes twoadditional steps. In the alternate embodiment, the control logic 404updates the LastWritten register 1502 after replacement of an invalidentry even if the other entry is valid.

[0234] Hence, in FIG. 17, after replacing entry B 602B during step 1614,the control logic 404 updates the LastWritten register 1502 to specifyside B, in step 1716. Additionally, after replacing entry A 602A duringstep 1624, the control logic 404 updates the LastWritten register 1502to specify side A, in step 1726.

[0235] Although simulations have revealed no observable performancedifference between the embodiment of FIGS. 16 and 17, it is observedthat the embodiment of FIG. 16 solves a problem that the embodiment ofFIG. 17 does not. The problem is illustrated by Code Sequence 2 below.

[0236] 0x00000010 JMP 0x12345678

[0237] 0x12345678 JMP 0x00000014

[0238] 0x00000014 JMP 0x20000000

[0239] Code Sequence 2.

[0240] The two JMP instructions at instruction pointers 0x00000010 and0x00000014 are in the same instruction cache 432 line and select thesame line in the BTAC 402. The JMP instruction at instruction pointer0x12345678 is in a different instruction cache 432 line from the othertwo JMP instructions and selects a different line in the BTAC 402 fromthe other two JMP instructions. Assume the following conditions when theJMP 0x12345678 instruction executes. The LastWritten register 1502specifies side B. Both the A entry 602A and the B entry 602B in the BTAC402 line and way selected by the instruction pointers for the JMP0x12345678 and JMP 0x20000000 instructions are invalid. The BTAC 402line and way selected by the instruction pointer for the JMP 0x00000014instruction indicates the A entry 602A is valid and the B entry 602B isinvalid. Assume the JMP 0x20000000 instruction executes before the JMP0x12345678 instruction updates the BTAC 402. Consequently, theinstruction pointers of the JMP 0x12345678 and JMP 0x20000000instructions select the same way in the same BTAC 402 line.

[0241] According to both FIGS. 16 and 17, when the JMP 0x12345678executes, the control logic 404 will replace entry A 602A with thetarget address of the JMP 0x12345678 during step 1634 and update theLastWritten register 1502 to specify side A during step 1636. Accordingto both FIGS. 16 and 17, when the JMP 0x00000014 executes, the controllogic 404 will replace entry B 602B with the target address of the JMP0x00000014 during step 1614. According to FIG. 17, the control logic 404will update the LastWritten register 1502 to specify side B during step1716. However, according to FIG. 16, the control logic 404 will notupdate the LastWritten register 1502; rather, the LastWritten register1502 will continue to specify side A. Consequently, when the JMP0x20000000 executes, according to FIG. 17, the control logic 404 willreplace the A entry 602A with the target address of the JMP 0x20000000during step 1634, thereby needlessly clobbering the target address ofthe JMP 0x12345678. In contrast, according to FIG. 16, when the JMP0x20000000 executes, the control logic 404 will replace the B entry 602Bduring step 1644, thereby advantageously leaving the target address ofthe JMP 0x12345678 in the A entry 602A intact.

[0242] Referring now to FIG. 18, a block diagram illustrating anapparatus for replacing a target address in the BTAC 402 of FIG. 4according to an alternate embodiment of the present invention is shown.The embodiment of FIG. 18 is similar to the embodiment of FIG. 15.However, in the embodiment of FIG. 18, the A/B LRU bit 1504 and T/NTbits 722 for both entries, shown as T/NT A 722A and T/NT B 722B, arestored in a separate array 1812 rather than in the data array 612.

[0243] The additional array 1812 is dual-ported; whereas, the data array612 is single-ported. Because the A/B LRU bit 1504 and T/NT bits 722 areupdated more frequently than the rest of fields in the entry 602,providing dual-ported access to the more frequently updated fieldsreduces the likelihood of a bottleneck being created at the BTAC 402during periods of high traffic. However, since dual-ported cache arraysare larger than single-ported cache arrays and consume more power, theless frequently accessed fields are stored in the single-ported dataarray 612.

[0244] Referring now to FIG. 19, a block diagram illustrating anapparatus for replacing a target address in the BTAC 402 of FIG. 4according to an alternate embodiment of the present invention is shown.The embodiment of FIG. 19 is similar to the embodiment of FIG. 15.However, the embodiment of FIG. 19 includes a third entry, entry C 602C,per BTAC 402 line and way. Entry C 602C is provided to the control logic404 via signals 1928. Advantageously, the embodiment of FIG. 19 supportsthe ability to speculatively branch to any of three branch instructionscached in a corresponding instruction cache 432 line selected by thefetch address 495, or in one embodiment to any of three branchinstructions cached in a corresponding instruction cache 432 half-line.

[0245] In addition, instead of the LastWritten register 1502, embodimentof FIG. 19 includes a register 1902 that includes both a LastWrittenvalue and a LastWrittenPrev value. When the LastWritten value isupdated, the control logic 404 copies the contents of the LastWrittenvalue to the LastWrittenPrev value prior to updating the LastWrittenvalue. Together, the LastWritten and LastWrittenPrev values enable thecontrol logic 404 to determine which of the three entries is the leastrecently written, as described presently in Table 2 and equationsfollowing. TABLE 2 Valid A Valid B Valid C Replace 0 0 0 LRW 0 0 1LRWofAandB 0 1 0 LRWofAandC 0 1 1 A 1 0 0 LRWofBandC 1 0 1 B 1 1 0 C 1 11 LRU

[0246] Table 2 is similar to Table 1, except that it has three inputs,including the additional VALID bit 702 for entry C 702C. In theequations, “lw” corresponds to the LastWritten value and “lwp”corresponds to the LastWrittenPrev value. In one embodiment, LastWrittenand LastWrittenPrev are updated only when all three entries are invalid,analogous to the method of FIG. 16. In an alternate embodiment,LastWritten and LastWrittenPrev are updated any time the control logic404 updates to an invalid entry, analogous to the method of FIG. 17.

[0247] Although the present invention and its objects, features, andadvantages have been described in detail, other embodiments areencompassed by the invention. For example, the BTAC may be arranged inany number of cache arrangements, including direct-mapped, fullyassociative, or different number of way caches. Furthermore, the size ofthe BTAC may be increased or decreased. Also, a fetch address other thanthe fetch address of the line actually containing the branch instructionbeing predicted may be used to index the BTAC and branch history table.For example, the fetch address of the previous fetch may be used toreduce the size of a bubble introduced before branching. Additionally,the number of target addresses stored in each way of the cache may bevaried. In addition, the size of the branch history table may vary andthe number of bits and form of the direction prediction informationstored therein may vary as well as the algorithm for indexing the branchhistory table. Furthermore, the size of the instruction cache may varyand the type of virtual fetch address used to index the instructioncache and BTAC may vary.

[0248] Those skilled in the art should appreciate that they can readilyuse the disclosed conception and specific embodiments as a basis fordesigning or modifying other structures for carrying out the samepurposes of the present invention without departing from the spirit andscope of the invention as defined by the appended claims.

We claim:
 1. An apparatus in a microprocessor for detecting that the microprocessor erroneously branched to a speculative target address that is provided by a branch target address cache (BTAC), the apparatus comprising: a storage element, for storing an indication of whether the microprocessor branched to the speculative target address provided by the BTAC without knowing whether an instruction associated with said indication is a branch instruction; instruction decode logic, configured to receive and decode said instruction subsequent to the microprocessor branching to the speculative target address; and prediction check logic, coupled to said instruction decode logic, for notifying branch control logic that the microprocessor erroneously branched to the speculative target address if said instruction decode logic indicates said instruction is not a branch instruction and said indication indicates the microprocessor branched to the speculative target address.
 2. The apparatus of claim 1, wherein said storage element is in an instruction buffer for storing instructions, including said instruction.
 3. The apparatus of claim 1, wherein said indication indicates that the microprocessor branched to the speculative target address cached in the BTAC without certainty that said instruction decoded by said instruction decode logic is a same instruction for which the BTAC cached the speculative target address.
 4. The apparatus of claim 1, wherein said indication indicates that the microprocessor branched to the speculative target address provided by the BTAC in response to a fetch address that selected a line of instructions in an instruction cache.
 5. The apparatus of claim 4, wherein said indication indicates that the microprocessor branched to the speculative target address in response to said fetch address without certainty whether a previously executed instruction, for which the BTAC cached said target address, is present in said instruction cache line.
 6. The apparatus of claim 1, wherein said instruction decode logic is configured to determine a first instruction length of said instruction.
 7. The apparatus of claim 6, wherein said prediction check logic is configured to notify said branch control logic that the microprocessor erroneously branched to the speculative target address if said first instruction length does not match a second instruction length that is cached in the BTAC and received therefrom.
 8. The apparatus of claim 1, wherein said prediction check logic is configured to notify said branch control logic that the microprocessor erroneously branched to the speculative target address if said indication is associated with a byte of said instruction not defined as a valid opcode byte by an instruction set of the microprocessor.
 9. The apparatus of claim 8, wherein said microprocessor instruction set is an x86 architecture instruction set.
 10. The apparatus of claim 1, wherein the apparatus further comprises: address generation logic, coupled to said instruction decode logic, for generating a correct target address of said instruction; and a comparator, coupled to said address generation logic, for comparing the speculative target address provided by the BTAC and said correct target address of said instruction, and for providing a mismatch indicator to said prediction check logic based on said comparing.
 11. The apparatus of claim 10, wherein said prediction check logic is configured to notify said branch control logic that the microprocessor erroneously branched to the speculative target address if said mismatch indicator indicates the speculative target address and said correct target address of said instruction do not match.
 12. The apparatus of claim 1, wherein the apparatus further comprises: execution logic, operatively coupled to said instruction decode logic, for determining a correct direction of said instruction, said correct direction specifying whether said instruction is taken or not taken, said execution logic providing said correct direction to said prediction check logic.
 13. The apparatus of claim 12, wherein said prediction check logic is configured to notify said branch control logic that the microprocessor erroneously branched to the speculative target address if said correct direction indicates said instruction is not taken.
 14. An apparatus in a microprocessor for detecting that the microprocessor erroneously speculatively branched to a target address that is provided by a speculative branch target address cache (BTAC), the apparatus comprising: a storage element, for storing an indication of whether the microprocessor speculatively branched to the target address provided by the BTAC based on an instruction cache fetch address without first determining whether a branch instruction is present in a line of instruction bytes in the instruction cache selected by said fetch address; instruction decode logic, configured to receive and decode said instruction bytes in said instruction cache line subsequent to the microprocessor speculatively branching to the target address, said instruction decode logic further configured to indicate whether said line includes a branch instruction; and prediction check logic, coupled to said instruction decode logic, for providing an error signal to branch control logic if said indication indicates the microprocessor speculatively branched to the target address and said instruction decode logic indicates said line does not include a branch instruction.
 15. The apparatus of claim 14, wherein the target address is provided by a speculative call/return stack in the microprocessor rather than the speculative BTAC in response to an indication cached in the BTAC that said line of instruction bytes includes a return instruction.
 16. A microprocessor for detecting and correcting an erroneous speculative branch, comprising: an instruction cache, for providing a line of instruction bytes selected by a fetch address, said fetch address provided to said instruction cache on an address bus; a speculative branch target address cache (BTAC), coupled to said address bus, for providing a speculative target address of a previously executed branch instruction in response to said fetch address whether or not said previously executed branch instruction is present in said line; control logic, coupled to said BTAC, configured to control a multiplexer to select said speculative target address as said fetch address during a first period; and prediction check logic, coupled to said BTAC, configured to detect that said control logic controlled said multiplexer to select said speculative target address erroneously; wherein said control logic is further configured to control said multiplexer to select a correct address as said fetch address during a second period in response to said prediction check logic detecting said erroneous selection.
 17. The microprocessor of claim 16, wherein said second period is subsequent to said first period.
 18. The microprocessor of claim 16, further comprising: instruction decode logic, configured to receive and decode said instruction bytes and to specify to said prediction check logic whether a branch instruction is present in said instruction bytes.
 19. The microprocessor of claim 18, wherein said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously comprises said prediction check logic determining that a branch instruction is not present in said instruction bytes.
 20. The microprocessor of claim 16, further comprising: branch target address generation logic, configured to receive said line of instruction bytes and to generate an instruction pointer of an instruction comprised in said line of instruction bytes; wherein said correct address comprises said instruction pointer of said instruction.
 21. The microprocessor of claim 20, wherein said instruction is comprised in said line of instruction bytes at a location of said previously executed branch instruction in said line.
 22. The microprocessor of claim 21, wherein said location is cached in said BTAC.
 23. The microprocessor of claim 16, further comprising: branch target address generation logic, configured to receive said line of instruction bytes and to generate a correct branch target address of a branch instruction comprised in said line of instruction bytes based on execution of said branch instruction comprised in said line of instruction bytes; wherein said correct address comprises said correct branch target address.
 24. The microprocessor of claim 23, wherein said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously comprises said prediction check logic determining that said correct branch target address and said speculative target address do not match.
 25. The microprocessor of claim 23, wherein said branch instruction is comprised in said line of instruction bytes at a location of said previously executed branch instruction in said line.
 26. The microprocessor of claim 25, wherein said location is cached in said BTAC.
 27. The microprocessor of claim 16, further comprising: execution logic, configured to receive said line of instruction bytes and to generate a correct direction of a branch instruction comprised in said line of instruction bytes, said correct direction generated based on execution of said branch instruction comprised in said line of instruction bytes.
 28. The microprocessor of claim 27, wherein said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously comprises said prediction check logic determining that said correct direction indicates said branch instruction comprised in said line of instruction bytes is not taken.
 29. The microprocessor of claim 27, wherein said branch instruction is comprised in said line of instruction bytes at a location of said previously executed branch instruction in said line.
 30. The microprocessor of claim 29, wherein said location is cached in said BTAC.
 31. The microprocessor of claim 16, further comprising: branch target address generation logic, configured to receive said line of instruction bytes and to generate an instruction pointer of a next instruction after an instruction comprised in said line of instruction bytes at a location of said previously executed branch instruction in said line; wherein said correct address comprises said instruction pointer of said next instruction after said instruction.
 32. The microprocessor of claim 31, wherein said location is cached in said BTAC.
 33. The microprocessor of claim 16, further comprising: instruction decode logic, configured to receive and decode said line of instruction bytes and to specify a length of an instruction comprised in said line of instruction bytes, said instruction being at a location of said previously executed branch instruction in said line.
 34. The microprocessor of claim 33, wherein said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously comprises said prediction check logic determining that said length of said instruction does not match an instruction length cached in said speculative BTAC for said previously executed branch instruction.
 35. The microprocessor of claim 34, further comprising: branch target address generation logic, configured to receive said instruction and to generate an instruction pointer of said instruction; wherein said correct address comprises said instruction pointer of said instruction.
 36. The microprocessor of claim 16, further comprising: instruction decode logic, configured to receive and decode said instruction and to specify which of a plurality of bytes comprising said instruction is an opcode byte.
 37. The microprocessor of claim 36, wherein said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously comprises said prediction check logic determining that said control logic controlled said multiplexer to select said speculative target address based on a byte of said instruction other than said opcode byte specified by said instruction decode logic.
 38. The microprocessor of claim 16, further comprising: branch target address generation logic, configured to receive said line of instruction bytes and to generate an instruction pointer of an instruction comprised in said line of instruction bytes, said instruction located at a location of said previously executed branch instruction in said line; wherein said correct address comprises said instruction pointer of said instruction.
 39. The microprocessor of claim 16, wherein an entry of said speculative BTAC caching said speculative target address is invalidated in response to said prediction check logic detecting said erroneous selection.
 40. The microprocessor of claim 16, wherein said speculative BTAC is updated with a direction prediction associated with said previously executed branch instruction, said speculative BTAC being updated in response to said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously.
 41. The microprocessor of claim 16, wherein said speculative target address is updated in said speculative BTAC in response to said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously.
 42. The microprocessor of claim 16, wherein said prediction check logic comprises an error output, coupled to said control logic, for notifying said control logic of said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously.
 43. The microprocessor of claim 16, wherein a plurality of pipeline stages of the microprocessor are flushed in response to said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously.
 44. The microprocessor of claim 16, further comprising: an instruction buffer, coupled to said instruction cache, for buffering said line of instruction bytes; wherein said instruction buffer is flushed in response to said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative target address erroneously.
 45. The microprocessor of claim 16, wherein said speculative BTAC and said instruction cache are accessed substantially in parallel.
 46. A method for recovering from an erroneous branch to a speculative target address, the method comprising: generating a speculative target address for a branch instruction that is presumed present in an instruction cache line selected by a fetch address; branching to said speculative target address whether or not said presumed branch instruction is present in said instruction cache line; generating a correct target address of the presumed branch instruction subsequent to said generating said speculative target address; determining if said speculative target address matches said correct target address; and branching to said correct target address if said speculative target address does not match said correct target address.
 47. The method of claim 46, further comprising: storing an indication of whether said branching to said speculative target address occurred, prior to said determining; and said branching to said correct target address only if said indication indicates said branching to said speculative target address occurred.
 48. The method of claim 46, wherein said generating said correct target address comprises calculating said correct target address using instruction bytes of the presumed branch instruction.
 49. The method of claim 46, further comprising: updating an entry containing said speculative target address in a branch target address cache with said correct target address if said speculative target address does not match said correct target address.
 50. A method for recovering from an erroneous branch to a speculative target address for a branch instruction, the branch instruction presumably present in a line of instructions, the line of instructions provided by an instruction cache in response to a fetch address, the speculative target address speculatively generated by a branch target address cache (BTAC) in response to the fetch address, the method comprising: decoding the presumed branch instruction subsequent to the BTAC speculatively generating the speculative target address; determining if the presumed branch instruction is a non-branch instruction in response to said decoding; and branching to an instruction pointer of the presumed branch instruction if the presumed branch instruction is a non-branch instruction.
 51. The method of claim 50, further comprising: calculating said instruction pointer of the presumed branch instruction in response to said decoding.
 52. The method of claim 50, further comprising: invalidating an entry containing the speculative target address in the BTAC if the presumed branch instruction is a non-branch instruction.
 53. The method of claim 52, wherein said invalidating is performed prior to said branching to said instruction pointer.
 54. A method for recovering from an erroneous branch to a speculative target address, the speculative target address associated with a branch instruction that is presumably present in a cache line selected by a fetch address, the speculative target address provided by a branch target address cache (BTAC) in response to the fetch address, the method comprising: decoding the presumed branch instruction subsequent to the BTAC providing the speculative target address; determining a length of the presumed branch instruction; and branching to an instruction pointer of the presumed branch instruction if said length of the presumed branch instruction does not match an instruction length speculatively provided by the branch target address cache.
 55. The method of claim 54, further comprising: invalidating an entry containing the speculative target address in the BTAC if said length of the presumed branch instruction does not match said instruction length speculatively provided by the branch target address cache.
 56. The method of claim 55, wherein said invalidating is performed prior to said branching to said instruction pointer.
 57. A method for recovering from an erroneous branch to a speculative target address, the method comprising: generating a speculative target address of a branch instruction that is presumed present in an instruction cache line selected by a fetch address; generating a speculative direction prediction of the presumed branch instruction; branching to said speculative target address whether or not said presumed branch instruction is present in said instruction cache line; generating a correct direction of the presumed branch instruction subsequent to said generating said speculative direction prediction; determining if said correct direction is not taken; and branching to an instruction pointer of a next instruction after the presumed branch instruction if said correct direction is not taken.
 58. The method of claim 57, further comprising: updating said speculative direction prediction in a branch target address cache in response to said correct direction if said correct direction is not taken.
 59. An apparatus in a microprocessor for detecting an erroneous branch to a speculative return address that is provided by a speculative call/return stack, the apparatus comprising: a storage element, for storing an indication of whether the microprocessor branched to the speculative return address without knowing whether or not an instruction associated with said indication is a branch instruction; instruction decode logic, configured to receive and decode said instruction subsequent to the microprocessor branching to the speculative return address; and prediction check logic, coupled to said instruction decode logic, for notifying branch control logic that the microprocessor erroneously branched to the speculative return address if said instruction decode logic indicates that said instruction is not a branch instruction and said indication indicates that the microprocessor branched to the speculative return address.
 60. A microprocessor for detecting and correcting an erroneous speculative branch, comprising: an instruction cache, for providing a line of instruction bytes selected by a fetch address; a speculative call/return stack, for providing a speculative return address of a previously executed branch instruction in response to said fetch address, said speculative return address provided whether or not said previously executed branch instruction is present in said line of instruction bytes; control logic, coupled to said speculative call/return stack, configured to control a multiplexer to select said speculative return address to be said fetch address during a first period; and prediction check logic, coupled to said control logic, configured to detect that said control logic controlled said multiplexer to select said speculative return address erroneously; wherein said control logic is further configured to control said multiplexer to select a correct address to be said fetch address during a second period, said control logic selecting said correct address in response to said prediction check logic detecting that said control logic controlled said multiplexer to select said speculative return address erroneously.
 61. A method in a microprocessor for recovering from an erroneous branch to a speculative target address of a presumed branch instruction, the method comprising: providing a speculative target address in response to an instruction cache fetch address; producing an instruction cache line in response to said instruction cache fetch address; decoding an instruction from said instruction cache line subsequent to said providing said speculative target address; wherein said decoding is performed for a first time by the microprocessor for said instruction; branching to said speculative target address prior to said decoding; and branching to a correct target address of said instruction subsequent to said branching to said speculative target address in response to said decoding.
 62. A method for recovering from an erroneous branch to a speculative target address, the method comprising: providing a speculative target address for a branch instruction that is presumed present in an instruction cache line that is selected by a fetch address; branching to said speculative target address whether or not said presumed branch instruction is present in said instruction cache line; and correcting from an erroneous branch if said presumed branch instruction is not present in said instruction cache line.
 63. A branch apparatus in a microprocessor for detecting when the microprocessor erroneously branches to a speculative target address, the speculative target address being provided by a branch target address cache (BTAC), the apparatus comprising: a branch hit indicator, provided to indicate when the microprocessor branches to the speculative target address, said branch hit indicator provided whether or not an instruction associated with said branch hit indicator is a branch instruction; instruction decode logic, configured to receive and decode said instruction and to specify whether said instruction is a branch instruction; and prediction check logic, coupled to said instruction decode logic, for determining that the microprocessor erroneously branched to the speculative target address; wherein the microprocessor erroneously branched to the speculative target address when said instruction decode logic specifies that said instruction is not a branch instruction, and said branch hit indicator indicates that the microprocessor branched to the speculative target address. 